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Abstract 

Overcomplete  latent  representations  have  been  very  popular  for  unsupervised  feature  learning 
in  recent  years.  In  this  paper,  we  specify  which  overcomplete  models  can  be  identified  given 
observable  moments  of  a  certain  order.  We  consider  probabilistic  admixture  or  topic  models 
in  the  overcomplete  regime,  where  the  number  of  latent  topics  can  greatly  exceed  the  size  of 
the  observed  word  vocabulary.  While  general  overcomplete  topic  models  are  not  identifiable,  we 
establish  generic  identifiability  under  a  constraint,  referred  to  as  topic  persistence.  Our  sufficient 
conditions  for  identifiability  involve  a  novel  set  of  “higher  order”  expansion  conditions  on  the 
topic-word  matrix  or  the  population  structure  of  the  model.  This  set  of  higher-order  expansion 
conditions  allow  for  overcomplete  models,  and  require  the  existence  of  a  perfect  matching  from 
latent  topics  to  higher  order  observed  words.  We  establish  that  random  structured  topic  models 
are  identifiable  w.h.p.  in  the  overcomplete  regime.  Our  identifiability  results  allows  for  general 
(non-degenerate)  distributions  for  modeling  the  topic  proportions,  and  thus,  we  can  handle 
arbitrarily  correlated  topics  in  our  framework.  Our  identifiability  results  imply  uniqueness  of  a 
class  of  tensor  decompositions  with  structured  sparsity  which  is  contained  in  the  class  of  Tucker 
decompositions,  but  is  more  general  than  the  Candecomp/Parafac  (CP)  decomposition. 

Keywords:  Overcomplete  representations,  topic  models,  generic  identifiability,  tensor  decomposi¬ 
tion. 


1  Introduction 


The  performance  of  many  machine  learning  methods  is  hugely  dependent  on  the  choice  of  data 
representations  or  features.  Overcomplete  representations,  where  the  number  of  features  can  be 
greater  than  the  dimensionality  of  the  input  data,  have  been  extensively  employed,  and  are  ar¬ 
guably  critical  in  a  number  of  applications  such  as  speech  and  computer  vision  [TJ.  Overcomplete 
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representations  are  known  to  be  more  robust  to  noise,  and  can  provide  greater  flexibility  in  mod¬ 
eling  [2].  Unsupervised  estimation  of  overcomplete  representations  has  been  hugely  popular  due  to 
the  availability  of  large-scale  unlabeled  samples  in  many  applications. 

A  probabilistic  framework  for  incorporating  features  posits  latent  or  hidden  variables  that  can  pro¬ 
vide  a  good  explanation  to  the  observed  data.  Overcomplete  probabilistic  models  can  incorporate 
a  much  larger  number  of  latent  variables  compared  to  the  observed  dimensionality.  In  this  paper, 
we  characterize  the  conditions  under  which  overcomplete  latent  variable  models  can  be  identified 
from  their  observed  moments. 

For  any  parametric  statistical  model,  identifiability  is  a  fundamental  question  of  whether  the  model 
parameters  can  be  uniquely  recovered  given  the  observed  statistics.  Identifiability  is  crucial  in  a 
number  of  applications  where  the  latent  variables  are  the  quantities  of  interest,  e.g.  inferring  dis¬ 
eases  (latent  variables)  through  symptoms  (observations),  inferring  communities  (latent  variables) 
via  the  interactions  among  the  actors  in  a  social  network  (observations),  and  so  on.  Moreover, 
identifiability  can  be  relevant  even  in  predictive  settings,  where  feature  learning  is  employed  for 
some  higher  level  task  such  as  classification.  For  instance,  non-identihability  can  lead  to  the  pres¬ 
ence  of  non-isolated  local  optima  for  optimization-based  learning  methods,  and  this  can  affect  their 
convergence  properties,  e.g.  see  [3j. 

In  this  paper,  we  characterize  identifiability  for  a  popular  class  of  latent  variable  models,  known 
as  the  admixture  or  topic  models  m\-  These  are  hierarchical  mixture  models,  which  incorpo¬ 
rate  the  presence  of  multiple  latent  states  (i.e.  topics)  in  each  document  consisting  of  a  tuple 
of  observed  variables  (i.e.  words).  Previous  works  have  established  that  the  model  parameters 
can  be  estimated  efficiently  using  low  order  observed  moments  (second  and  third  order)  under 
some  non-degeneracy  assumptions,  e.g.  M-  However,  these  non-degeneracy  conditions  imply 
that  the  model  is  under  complete,  i.e.,  the  latent  dimensionality  (number  of  topics)  cannot  exceed 
the  observed  dimensionality  (word  vocabulary  size).  In  this  paper,  we  remove  this  restriction  and 
consider  overcomplete  topic  models,  where  the  number  of  topics  can  far  exceed  the  word  vocabulary 
size. 

It  is  perhaps  not  surprising  that  general  topic  models  are  not  identifiable  in  the  overcomplete 
regime.  To  this  end,  we  introduce  an  additional  constraint  on  the  model,  referred  to  as  topic 
persistence.  Intuitively,  this  captures  the  “locality”  effect  among  the  observed  words,  and  is  not 
present  in  the  usual  “bag-of-words”  or  exchangeable  topic  model.  Such  local  dependencies  among 
observations  abound  in  applications  such  as  text,  images  and  speech,  and  can  lead  to  a  more  faithful 
representation.  In  addition,  we  establish  that  the  presence  of  topic  persistence  is  central  towards 
obtaining  model  identifiability  in  the  overcomplete  regime,  and  we  provide  an  in-depth  analysis  of 
this  phenomenon  in  this  paper. 


1.1  Summary  of  results 

In  this  paper,  we  provide  conditions  for  generic^  model  identifiability  of  overcomplete  topic  models 
given  observable  moments  of  a  certain  order  (i.e.,  having  a  certain  number  of  words  in  each  doc- 

XA  model  is  generically  identifiable,  if  all  the  parameters  in  the  parameter  space  are  identifiable,  almost  surely. 
Refer  to  Definition  |T]  for  more  discussion. 
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X\  %n  2- rtf-1  X2  n  a'(2r'— l)rtf-l  #2  rn 

Figure  1:  Hierarchical  structure  of  the  n-persistent  topic  model.  2 rn  number  of  words  (views)  are 
shown  for  some  integer  r  >  1.  A  single  topic  yj,j  £  [2r],  is  chosen  for  each  n  successive  views 
. . . ,  X(j_1)Tj+ra}.  Matrix  A  is  the  population  structure  or  topic-word  matrix. 


ument).  We  introduce  the  notion  of  topic  persistence ,  and  analyze  its  effect  on  identifiability.  We 
establish  identifiability  in  the  presence  of  a  novel  combinatorial  object,  referred  to  as  perfect  n-gram 
matching ,  in  the  bipartite  graph  from  topics  to  words.  Finally,  we  prove  that  random  structured 
topic  models  satisfy  these  criteria,  and  are  thus  identifiable  in  the  overcomplete  regime. 


Persistent  Topic  Model:  We  first  introduce  the  ?r-persistent  topic  model,  where  the  parameter 
n  determines  the  persistence  level  of  a  common  topic  in  a  sequence  of  n  successive  words.  For 
instance,  in  Figure  |T]  the  sequence  of  successive  words  xi,...,xn  share  a  common  topic  yi,  and 
similarly,  the  words  xn+i, . . . ,  X2n  share  topic  y2,  and  so  on.  The  n-persistent  model  reduces  to  the 
popular  “bag-of- words”  model,  when  n  =  1,  and  to  the  single  topic  model  (i.e.  only  one  topic  in  each 
document)  when  n  —>■  oo.  Intuitively,  topic  persistence  aids  identifiability  since  we  have  multiple 
views  of  the  common  hidden  topic  generating  a  sequence  of  successive  words.  We  establish  that 
the  bag-of-words  model  (with  n  =  1)  is  too  non-informative  about  the  topics  in  the  overcomplete 
regime,  and  is  therefore,  not  identifiable.  On  the  other  hand,  n-persistent  overcomplete  topic 
models  with  n  >  2  can  become  identifiable,  and  we  establish  a  set  of  transparent  conditions  for 
identifiability. 


Deterministic  Conditions  for  Identifiability:  Our  sufficient  conditions  for  identifiability  are 
in  the  form  of  expansion  conditions  from  the  latent  topic  space  to  the  observed  word  space.  In  the 
overcomplete  regime,  there  are  more  topics  than  words  in  the  vocabulary,  and  thus  it  is  impossible 
to  have  expansion  on  the  bipartite  graph  from  topics  to  words,  i.e.,  the  graph  encoding  the  sparsity 
pattern  of  the  topic-word  matrix.  Instead,  we  impose  an  expansion  constraint  from  topics  to 
“higher  order”  words,  which  allows  us  to  incorporate  overcomplete  models.  We  establish  that 
this  condition  translates  to  the  presence  of  a  novel  combinatorial  object,  referred  to  as  the  perfect 
n-gram  matching ,  on  the  topic-word  bipartite  graph.  Intuitively,  the  perfect  n-gram  matching 
condition  implies  “diversity”  among  the  higher-order  word  supports  for  different  topics  which  leads 
to  identifiability.  In  addition,  we  present  trade-offs  among  the  following  quantities:  number  of 
topics,  size  of  the  word  vocabulary,  the  topic  persistence  level,  the  order  of  the  observed  moments 
at  hand,  the  minimum  and  maximum  degrees  of  any  topic  in  the  topic-word  bipartite  graph, 
and  the  Kruskal  rank  [9j  of  the  topic- word  matrix,  under  which  identifiability  holds.  To  the  best 
of  our  knowledge,  this  is  the  first  work  to  provide  conditions  for  characterizing  identifiability  of 
overcomplete  topic  models  with  structured  sparsity. 
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Identifiability  of  Random  Structured  Topic  Models:  We  explicitly  characterize  the  regime 
of  identifiability  for  the  random  setting,  where  each  topic  i  is  randomly  supported  on  a  set  of  di 
words,  i.e.  the  bipartite  graph  is  a  random  graph.  For  this  random  model  with  q  topics,  p- 
dimensional  word  vocabulary,  and  topic  persistence  level  n.  when  q  =  0(pn )  and  0(logp)  <  di  < 
Q(p1^n),  for  all  topics  i,  the  topic- word  matrix  is  identifiable  from  2 reth  order  observed  moments 
with  high  probability.  Intuitively,  the  upper  bound  on  the  degrees  di  is  needed  to  limit  the  overlap 
of  word  supports  among  different  topics  in  the  overcomplete  regime:  as  the  number  of  topics  q 
increases  (i.e.,  n  increases  in  the  above  degree  bound),  the  degree  needs  to  be  correspondingly 
smaller  to  ensure  identifiability,  and  we  make  this  dependence  explicit.  Intuitively,  as  the  extent  of 
overcompleteness  increases,  we  need  sparser  connections  from  topics  to  words  to  ensure  sufficient 
diversity  in  the  word  supports  among  different  topics.  The  lower  bound  on  the  degrees  is  required 
so  that  there  are  enough  edges  in  the  topic-word  bipartite  graph  so  that  various  topics  can  be 
distinguished  from  one  another.  Furthermore,  we  establish  that  the  size  condition  q  =  0(pn )  for 
identifiability  is  tight. 


Implications  on  Uniqueness  of  Overcomplete  Tucker  and  CP  Tensor  Decompositions: 

We  establish  that  identifiability  of  an  overcomplete  topic  model  is  equivalent  to  uniqueness  of 
decomposition  of  the  observed  moment  tensor  (of  a  certain  order).  Our  identifiability  results  for 
persistent  topic  models  imply  uniqueness  of  a  structured  class  of  tensor  decompositions,  which  is 
contained  in  the  class  of  Tucker  decompositions,  but  is  more  general  than  the  candecomp/parafac 
(CP)  decomposition  jTDj.  This  sub-class  of  Tucker  decompositions  involves  structured  sparsity  and 
symmetry  constraints  on  the  core  tensor,  and  sparsity  constraints  on  the  inverse  factors  of  the 
Tucker  decomposition.  The  structural  constraints  on  the  Tucker  tensor  decomposition  are  related 
to  the  topic  model  as  follows:  the  sparsity  and  symmetry  constraints  on  the  core  tensor  are  related 
to  the  persistence  property  of  the  topic  model,  and  the  sparsity  constraints  on  the  inverse  factors  are 
equivalent  to  the  sparsity  constraints  on  the  topic-word  matrix.  For  n-persistent  topic  model  with 
n  =  1  (bag-of- words  model),  the  tensor  decomposition  is  a  general  Tucker  decomposition,  where  the 
core  tensor  is  fully  dense,  while  for  n  — >•  oo  (single-topic  model),  the  tensor  decomposition  reduces 
to  a  CP  decomposition,  i.e.  the  core  tensor  is  a  diagonal  tensor.  For  a  finite  persistence  level  n, 
in  between  these  two  extremes,  the  core  tensor  satisfies  certain  sparsity  and  symmetry  constraints, 
which  becomes  crucial  towards  establishing  identifiability  in  the  overcomplete  regime. 


1.2  Overview  of  Techniques 

We  now  provide  a  short  overview  of  the  techniques  employed  in  this  paper. 


Recap  of  Identifiability  Conditions  in  Under-complete  Setting  (Expansion  Conditions 
on  Topic- Word  Matrix):  Our  approach  is  based  on  the  recent  results  of  [7],  where  conditions 
for  identifiability  of  topic  models  are  derived,  given  pairwise  observed  moments  (specifically,  co¬ 
occurrence  of  word-pairs  in  documents).  Consider  a  topic  model  with  q  topics  and  observed  word 
vocabulary  of  size  p.  Let  A  E  M.pxg  denote  the  topic-word  matrix.  Expansion  conditions  are  imposed 
in  [TJ  on  the  topic-word  bipartite  graph  which  imply  that  (generically)  the  sparsest  vectors  in  the 
column  span  of  A,  denoted  by  Col(T),  are  the  columns  of  A  themselves.  Thus  the  topic- word  matrix 
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A  is  identifiable  from  pairwise  moments  under  expansion  constraints.  However,  these  expansion 
conditions  constrain  the  model  to  be  under-complete,  i.e.,  the  number  of  topics  q  <  p,  the  size  of 
the  word  vocabulary.  Therefore,  the  techniques  derived  in  [7]  are  not  directly  applicable  here  since 
we  consider  overcomplete  models. 


Identifiability  in  Overcomplete  Setting  and  Why  Topic-Persistence  Helps:  Pairwise 

moments  are  thus  not  sufficient  for  identifiability  of  overcomplete  models,  and  the  question  is 
whether  higher  order  moments  can  yield  identifiability.  We  can  view  the  higher  order  moments 
as  pairwise  moments  of  another  equivalent  topic  model,  which  enables  us  to  apply  the  techniques 
of  [Tjj.  The  key  question  is  whether  we  have  expansion  in  the  equivalent  topic  model,  which  implies 
identifiability.  For  a  general  topic  model  (without  any  topic  persistence  constraints),  it  can  be 
shown  that  for  identifiability,  we  require  expansion  of  the  nth-order  Kronecker  product  of  the  original 
topic-word  matrix  A.  denoted  by  A®n  G  W,nxqU ,  when  given  access  to  (2?i)th-order  moments,  for 
any  integer  n  >  1.  In  the  overcomplete  regime  where  q  >  p,  A®n  cannot  expand,  and  therefore, 
overcomplete  models  are  not  identifiable  in  general.  On  the  other  hand,  we  show  that  imposing 
the  constraint  of  topic  persistence  can  lead  to  identifiability.  For  a  n-persistent  topic  model,  given 
(2n)th-order  moments,  we  establish  that  identifiability  occurs  when  the  nth-order  Khatri-Rao  product 
of  A,  denoted  by  A&n  G  expands.  Note  that  the  Khatri-Rao  product  A0n  is  a  sub-matrix  of 

the  Kronecker  product  A®n,  and  the  Khatri-Rao  product  AQn  can  expand  as  long  as  q  <  pn.  Thus, 
the  property  of  topic  persistence  is  central  towards  achieving  identifiability  in  the  overcomplete 
regime. 


First-Order  Approach  for  Identifiability  of  Overcomplete  Models  (Expansion  of  n- 
gram  Topic- Word  Matrix):  We  refer  to  A&n  G  as  the  n-grarn  topic- word  matrix,  and 

intuitively,  it  relates  topics  to  n-tuple  words.  Imposing  the  expansion  conditions  derived  in  [7]  on 
A&n  implies  that  (generically)  the  sparsest  vectors  in  Col(A0n),  are  the  columns  of  A0"  themselves. 
Thus,  the  topic-word  matrix  A  is  identifiable  from  (2n)th-order  moments  for  a  n-persistent  topic 
model.  We  refer  to  this  as  the  “first-order”  approach  since  we  directly  impose  the  expansion 
conditions  of  [7]  on  A&n,  without  exploiting  the  additional  structure  present  in  A&n. 


Why  the  First-Order  Approach  is  not  Enough:  Note  that  A0n  G  Mpn><9  matrix  relates 

topics  to  n-tuples  of  words.  Thus,  the  entries  of  A0n  are  highly  correlated,  even  if  the  original 
topic-word  matrix  A  is  assumed  to  be  randomly  generated.  It  is  non-trivial  to  derive  conditions 
on  A,  so  that  A0n  expands.  Moreover,  we  establish  that  AQn  fails  to  expand  on  “small”  sets,  as 
required  in  0,  when  the  degrees  are  sufficiently  different!.  Thus,  the  first-order  approach  is  highly 
restrictive  in  the  overcomplete  setting. 

2For  A@n  to  expand  on  a  set  of  size  s  >  2,  it  is  necessary  that  s  ■  (dmin7"_1)  >  s  +  (dmf**+7l_1)  j  where  dmjn  and 
dmax  are  the  minimum  and  maximum  degrees,  and  n  is  the  extent  of  overcompleteness:  q  =  0(pn).  When  the  model 
is  highly  overcomplete  (large  n)  and  we  require  small  set  expansion  (small  s),  the  degrees  need  to  be  nearly  the  same. 
Thus,  it  is  desirable  to  impose  expansion  only  on  large  sets,  since  it  allows  for  more  degree  diversity. 
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Incorporating  Rank  Criterion:  Note  that  A®n  is  highly  structured:  the  columns  of  A®n 

matrix  possess  a  tensorU  rank  of  1,  when  n  >  1.  This  can  be  incorporated  in  our  identifiability 
criteria  as  follows:  we  provide  conditions  under  which  the  sparsest  vectors  in  Col(A0n),  which  also 
possess  a  tensor  rank  of  1,  are  the  columns  of  A&n  themselves.  This  implies  identifiability  of  a  n- 
persistent  topic  model,  when  given  access  to  (2n)th-order  moments.  Note  that  when  a  small  number 
of  columns  of  A('m  are  combined,  the  resulting  vector  cannot  possess  a  tensor  rank  of  1,  and  thus, 
we  can  rule  out  that  such  sparse  combinations  of  columns  using  the  rank  criterion.  The  maximum 
such  number  is  at  least  the  Kruskal  ran/J^|  of  A.  Thus,  sparse  combinations  of  columns  of  A  (up  to 
the  Kruskal  rank)  can  be  ruled  out  using  the  rank  criterion,  and  we  require  expansion  on  AQn  only 
on  large  sets  of  topics  (of  size  larger  than  the  Kruskal  rank).  This  agrees  with  the  intuition  that 
when  the  topic-word  matrix  A  has  a  larger  Kruskal  rank,  it  should  be  easier  to  identify  A,  since 
the  Kruskal  rank  is  related  to  the  mutual  incoherenc e@  among  the  columns  of  A.  see  fTTj. 


Notion  of  Perfect  n-gram  Matching  and  Final  Identifiability  Conditions:  Thus,  we 

establish  identifiability  of  overcomplete  topic  models  subject  to  expansion  conditions  A&n  on  sets 
of  size  larger  than  the  Kruskal  rank  of  the  topic-word  matrix  A.  However,  it  is  desirable  to  impose 
transparent  and  interpretable  conditions  directly  on  A  for  identifiability.  We  introduce  the  notion 
of  perfect  n-gram  matching  on  the  topic-word  bipartite  graph,  which  ensures  that  each  topic  can 
be  uniquely  matched  to  a  n-tuple  word.  This  combined  with  a  lower  bound  on  the  Kruskal  rank 
provides  the  final  set  of  deterministic  conditions  for  identifiability  of  the  overcomplete  topic  model. 
Intuitively,  we  require  that  the  columns  of  A  be  sparse,  while  still  maintaining  a  large  enough 
Kruskal  rank;  in  other  words,  the  topics  have  to  be  sparse  and  have  sufficiently  diverse  word 
supports.  Thus,  we  establish  identifiability  under  a  set  of  transparent  conditions  on  the  topic-word 
matrix  A.  consisting  of  perfect  n-gram  matching  condition  and  a  lower  bound  on  the  Kruskal  rank 
of  A. 


Analysis  under  Random-Structured  Topic- Word  Matrices:  Finally,  we  establish  that 

the  derived  deterministic  conditions  are  satisfied  when  the  topic-word  bipartite  graph  is  randomly 
generated,  as  long  as  the  degrees  satisfy  certain  lower  and  upper  bounds.  Intuitively,  a  lower 
bound  on  the  degrees  of  the  topics  is  required  to  have  degree  concentration  on  various  subsets 
so  that  expansion  can  occur,  while  the  upper  bound  is  required  so  that  the  Kruskal  rank  of  the 
topic-word  matrix  is  large  enough  compared  to  the  sparsity  level.  Here,  the  main  technical  result 
is  establishing  the  presence  of  a  perfect  n-gram  matching  in  a  random  bipartite  graph  with  a  wide 
range  of  degrees.  We  present  a  greedy  and  a  recursive  mechanism  for  constructing  such  a  n-gram 
matching  for  overcomplete  models,  which  can  be  relevant  even  in  other  settings.  For  instance,  our 
results  imply  the  presence  of  a  perfect  matching  when  the  edges  of  a  bipartite  graph  are  correlated 
in  a  structured  manner,  as  given  by  the  Khatri-Rao  product. 

3When  any  column  of  A0n  €  Rpnxq  (of  length  pn)  is  reshaped  as  a  nth-order  tensor  T  €  rpxpx-"><P)  the  tensor  T 
is  rank  1. 

4The  Kruskal  rank  is  the  maximum  number  k  such  that  every  fc-subset  of  columns  of  A  are  linearly  independent. 
Note  that  the  Kruskal  rank  is  equal  to  the  rank  of  A,  when  A  has  full  column  rank.  But  this  cannot  happen  in  the 
overcomplete  setting. 

5It  is  easy  to  show  that  krank  >  (max^  \aj aj\)_1 ,  where  a,i,a,j  are  any  Pair  of  columns  of  A.  Thus,  higher 
incoherence  leads  to  a  larger  kruskal  rank. 
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1.3  Related  works 


We  now  summarize  some  recent  related  works  in  the  area  of  identifiability  and  learning  of  latent 
variable  models. 


Identifiability,  learning  and  applications  of  overcomplete  latent  representations:  Many 
recent  works  employ  unsupervised  estimation  of  overcomplete  features  for  higher  level  tasks  such 
classification,  e.g.  mmm,  and  record  huge  gains  over  other  approaches  in  a  number  of  applications 
such  as  speech  recognition  and  computer  vision.  However,  theoretical  understanding  regarding 
learnability  or  identifiability  of  overcomplete  representations  is  far  more  limited. 

Overcomplete  latent  representations  have  been  analyzed  in  the  context  of  the  independent  com¬ 
ponents  analysis  (ICA),  where  the  sources  are  assumed  to  be  independent,  and  the  mixing  matrix 
is  unknown.  In  the  overcomplete  or  under-determined  regime  of  the  ICA,  there  are  more  sources 
than  sensors.  Identifiability  and  learning  of  the  overcomplete  ICA  reduces  to  the  problem  of  finding 
an  overcomplete  candecomp/parafac  (CP)  tensor  decomposition.  The  classical  result  by  Kruskal 
provides  conditions  for  uniqueness  of  a  CP  decomposition  mm,  with  recent  extensions  to  the 
notion  of  robust  identifiability  m •  These  results  provide  conditions  for  strict  identifiability  of  the 
model,  and  here,  the  dimensionality  of  the  latent  space  is  required  to  be  of  the  same  order  as  the 
observed  space  dimensionality.  In  contrast,  a  number  of  recent  works  analyze  generic  identifiability 
of  overcomplete  CP  decomposition,  which  is  weaker  than  strict  identifiability,  e.g.  mm- These 
works  assume  that  the  factors  (i.e.  the  components)  of  the  CP  decomposition  are  generically  drawn 
and  provide  conditions  for  uniqueness.  They  allow  for  the  latent  dimensionality  to  be  much  larger 
(polynomially  larger)  than  the  observed  dimensionality.  These  results  on  the  uniqueness  of  CP 
decompositions  also  lead  to  identifiability  of  other  latent  variable  models,  such  as  latent  tree  mod¬ 
els,  e.g.  [241125] .  and  the  single-topic  model,  or  more  generally  latent  Dirichlet  allocation  (LDA). 
Recently  Goyal  et.  al.  [26]  proposed  an  alternative  framework  for  overcomplete  ICA  models  based 
on  the  eigen-decomposition  of  the  reweighted  covariance  matrix  (or  higher  order  moments),  where 
the  weights  are  the  Fourier  coefficients.  However,  their  approach  requires  independence  of  sources 
(i.e.  latent  topics  in  our  context),  which  is  not  imposed  here. 

In  contrast  to  the  above  works  dealing  with  the  CP  tensor  decomposition,  we  require  uniqueness  for 
a  more  general  class  of  tensor  decompositions,  in  order  to  establish  identifiability  of  topic  models 
with  arbitrarily  correlated  topics.  We  establish  that  our  class  of  tensor  decomposition  is  contained 
in  the  class  of  Tucker  decompositions  which  is  more  general  than  CP  decomposition.  Moreover,  we 
explicitly  characterize  the  effect  of  the  sparsity  pattern  of  the  factors  (i.e.,  the  topic- word  matrix) 
on  model  identifiability,  while  all  the  previous  works  based  on  generic  identifiability  assume  fully 
dense  factors  (since  sparse  factors  are  not  generic).  For  a  general  overview  of  tensor  decompositions, 

see  mm- 


Identifiability  and  learning  of  undercomplete/over-determined  latent  representations: 

Much  of  the  theoretical  results  on  identifiability  and  learning  of  the  latent  variable  models  are 
limited  to  non-singular  models,  which  implies  that  the  latent  space  dimensionality  is  at  most  the 
observed  dimensionality.  We  outline  some  of  the  recent  works  below. 
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The  works  of  Anandkumar  et.  al.  [6ll281l29]  provide  an  efficient  moment-based  approach  for  learning 
topic  models,  under  constraints  on  the  distribution  of  the  topic  proportions,  e.g.  the  single  topic 
model,  and  more  generally  latent  Dirichlet  allocation  (LDA).  In  addition,  the  approach  can  handle 
a  variety  of  latent  variable  models  such  as  Gaussian  mixtures,  hidden  Markov  models  (HMM)  and 
community  models  m ■  The  high-level  idea  is  to  reduce  the  problem  of  learning  of  the  latent 
variable  model  to  finding  a  CP  decomposition  of  the  (suitably  adjusted)  observed  moment  tensor. 
Various  approaches  can  then  be  employed  to  find  the  CP  decomposition.  In  [6],  a  tensor  power 
method  approach  is  analyzed  and  is  shown  to  be  an  efficient  guaranteed  recovery  method  in  the  non¬ 
degenerate  (i.e.  undercomplete)  setting.  Previously,  simultaneous  diagonalization  techniques  have 
been  employed  for  solving  the  CP  decomposition,  e.g.  [28|[3Tl[32] .  However,  these  techniques  fail 
when  the  model  is  overcomplete,  as  considered  here.  We  note  that  some  recent  techniques,  e.g.  [20], 
can  be  employed  instead,  albeit  at  a  cost  of  higher  computational  complexity  for  overcomplete  CP 
tensor  decomposition.  However,  it  is  not  clear  how  the  sparsity  constraints  affect  the  guarantees 
of  such  methods.  Moreover,  these  approaches  cannot  handle  general  topic  models,  where  the 
distribution  of  the  topic  proportions  is  not  limited  to  these  classes  (i.e.  either  single  topic  or 
Dirichlet  distribution),  and  we  require  tensor  decompositions  which  are  more  general  than  the  CP 
decomposition. 

There  are  many  other  works  which  consider  learning  mixture  models  when  multiple  views  are 
available.  See  [28j  for  a  detailed  description  of  these  works.  Recently,  Rabani  et.  al.  |33j  consider 
learning  discrete  mixtures  given  a  large  number  of  “views” ,  and  they  refer  to  the  number  of  views 
as  the  sampling  aperture.  They  establish  improved  recovery  results  (in  terms  of  l\  bounds)  when 
sufficient  number  of  views  are  available  (2k  —  1  views  for  a  ^’-component  mixture).  However,  their 
results  are  limited  to  discrete  mixtures  or  single-topic  models,  while  our  setting  can  handle  more 
general  topic  models.  Moreover,  our  approach  is  different  since  we  incorporate  sparsity  constraints 
in  the  topic-word  distribution.  Another  series  of  recent  works  by  Arora  et.  al.  [£l(34]  employ 
approaches  based  on  non- negative  matrix  factorization  (NMF)  to  recover  the  topic- word  matrix. 
These  works  allow  models  with  arbitrarily  correlated  topics,  as  considered  here.  They  establish 
guaranteed  learning  when  every  topic  has  an  anchor  word,  i.e.  the  word  is  uniquely  generated 
from  that  topic,  and  does  not  occur  under  any  other  topic.  Note  that  the  anchor-word  assumption 
cannot  be  satisfied  in  the  overcomplete  setting. 

Our  work  is  closely  related  to  the  work  of  Anandkumar  et.  al.  (7]  which  considers  identifiability 
and  learning  of  topic  models  under  expansion  conditions  on  the  topic-word  matrix.  The  work 
of  Spielman  et.  al  [35]  considers  the  problem  of  dictionary  learning,  which  is  closely  related  to 
the  setting  of  [7],  but  in  addition  assumes  that  the  coefficient  matrix  is  random.  However,  these 
works  035]  can  handle  only  the  under-complete  setting,  where  the  number  of  topics  is  less  than  the 
dimensionality  of  the  word  vocabulary  (or  the  number  of  dictionary  atoms  is  less  than  the  number 
of  observations  in  [35]).  We  extend  these  results  to  the  overcomplete  setting  by  proposing  novel 
higher  order  expansion  conditions  on  the  topic-word  matrix,  and  also  incorporate  additional  rank 
constraints  present  in  higher  order  moments. 


Dictionary  learning/sparse  coding:  Overcomplete  representations  have  been  very  popular 

in  the  context  of  dictionary  learning  or  sparse  coding.  Here,  the  task  is  to  jointly  learn  a  dictionary 
as  well  as  a  sparse  selection  of  the  dictionary  atoms  to  fit  the  observed  data.  There  have  been 
Bayesian  as  well  as  frequentist  approaches  for  dictionary  learning  [211361137],  However,  the  heuristics 


employed  in  these  works  [2ll36ll37]  have  no  performance  guarantees.  The  work  of  Spielman  et. 
al  [35]  considers  learning  (undercomplete)  dictionaries  and  provide  guaranteed  learning  under  the 
assumption  that  the  coefficient  matrix  is  random  (distributed  as  Bernoulli-Gaussian  variables). 
Recent  works  [38]l39]  provide  generalization  bounds  for  predictive  sparse  coding,  where  the  goal 
of  the  learned  representation  is  to  obtain  good  performance  on  some  predictive  task.  This  differs 
from  our  framework  since  we  do  not  consider  predictive  tasks  here,  but  the  task  of  recovering  the 
underlying  latent  representation.  Hillar  and  Sommer  m  consider  the  problem  of  identifiability  of 
sparse  coding  and  establish  that  when  the  dictionary  succeeds  in  reconstructing  a  certain  set  of 
sparse  vectors,  then  there  exists  a  unique  sparse  coding,  up  to  permutation  and  scaling.  However, 
our  setting  here  is  different,  since  we  do  not  assume  that  a  sparse  set  of  topics  occur  in  each 
document. 


2  Model 


Notation:  The  set  {1,2,...,  n }  is  denoted  by  [n]  :=  {1,2,...,  n}.  Given  a  set  X  =  {1, . . .  ,p}, 

set  X ^  denotes  all  ordered  n-tuples  generated  from  X.  The  cardinality  of  a  set  S  is  denoted  by 
\S\.  For  any  vector  u  (or  matrix  U),  the  support  is  denoted  by  Supp(u),  and  the  £q  norm  is  denoted 
by  Ho,  which  corresponds  to  the  number  of  non-zero  entries  of  u,  i.e. ,  ||zz||o  :=  |  Supp(u)|.  For  a 
vector  u  E  R9,  Diag(u)  E  R9X<?  is  the  diagonal  matrix  with  vector  u  on  its  diagonal.  The  column 
space  of  a  matrix  A  is  denoted  by  Col(A).  Vector  e*  E  R9  is  the  z-th  basis  vector,  with  the  z-th 
entry  equal  to  1  and  all  the  others  equal  to  zero.  For  A  E  RpX9  and  B  E  Rmxn,  the  Kronecker 
product  A  <8)  B  E  MpmX9n  is  defined  as  [TI] 


A®B  = 


and  for  A  =  [ai|a2|  ■  ■  •  |ar]  E  Rpxr  and 
Rpmxr  is  defined  as 


auB 

a12B 

■  ■  alqB 

a2iB 

a22B 

■  ■  a2qB 

Q>pi  B 

ap2B 

dpqB 

B  =  [bi\b2\  -  ■  ■  \br] 

E  Rmxr, 

AQ  B  =  [ai  (8)  b\\a2  (g>  bo\  ■  ■  ■  \ar  <g>  br\  . 

2.1  Persistent  topic  model 

In  this  section,  the  n-persistent  topic  model  is  introduced  and  this  imposes  an  additional  constraint, 
known  as  topic  persistence  on  the  popular  admixture  model  [4l[7jll42j .  The  n-persistent  topic  model 
reduces  to  the  bag-of- words  admixture  model  when  n  =  1. 

An  admixture  model  specifies  a  g-dimensional  vector  of  topic  proportions  h  E  A9-1  :=  {u  E  R9  : 
Ui  >  0,  l  ui  =  1}  which  generates  the  observed  variables  xi  E  Rp  through  vectors  ai, . . . ,  aq  E 
Rp.  This  collection  of  vectors  Oj,z  E  [g],  is  referred  to  as  the  population  structure  or  the  topic-word 
matrix  [42].  For  instance,  a*  is  the  conditional  distribution  of  words  given  topic  z.  The  latent 
variable  h  is  a  q  dimensional  random  vector  h  :=  [hi, ... ,  hq]T  known  as  proportion  vector.  A  prior 
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distribution  P(h )  over  the  probability  simplex  A9-1  characterizes  the  prior  joint  distribution  over 
the  latent  variables  hi,  i  £  [g] .  In  the  topic  modeling,  this  is  the  prior  distribution  over  the  q 
topics. 

The  n-persistent  topic  model  has  a  three-level  multi-view  hierarchy  in  Figure  [TJ  2 rn  number  of 
words  (views)  are  shown  in  the  model  for  some  integer  r  >  1.  In  this  model,  a  common  hidden  topic 
is  persistent  for  a  sequence  of  n  words  . . .  ,x^_-^n+n\,j  £  [2r],  Note  that  the  random 

observed  variables  (words)  are  exchangeable  within  groups  of  size  n,  where  n  is  the  persistence 
level,  but  are  not  globally  exchangeable. 

We  now  describe  a  linear  representation  of  the  n-persistent  topic  model,  on  lines  of  [6],  but  with 
extensions  to  incorporate  persistence.  Each  random  variable  yj,  j  €  [2r],  is  a  discrete  valued  random 
variable  taking  one  of  the  q  possibilities  {1, . . .  ,  g},  i.e. ,  yj  £  [g]  for  j  £  [2 r\.  In  the  n-persistent 
model,  a  single  common  topic  is  chosen  for  a  sequence  of  n  words  {x^^n+h . . .  ,X(j-i)n+n},  j  £ 
[2r],  i.e.,  the  topic  is  persistent  for  n  successive  views.  For  notational  purposes,  we  equivalently 
assume  that  variables  yj,  j  £  [2r],  are  encoded  by  the  basis  vectors  e*,  i  £  [q].  Thus,  the  variable 
VjJ  £  [2t],  is 


yj  =  et  £  M9  the  topic  of  j- th  group  of  words  is  i. 

Given  proportion  vector  h,  topics  yj ,  j  £  [2 r],  are  independently  drawn  according  to  the  conditional 
expectation 


E[yj\h]=h,  j  £  [2r\, 

or  equivalently  Pr [yj  =  e, | /),]  =  hi,j  £  [2 r\,i  £  [g]. 

Finally,  at  the  bottom  layer,  each  observed  variable  xi  for  l  £  [2 rn],  is  a  discrete- valued  p- 
dimensional  random  variable,  where  p  is  the  size  of  word  vocabulary.  Again,  we  assume  that 
variables  xi,  are  encoded  by  the  basis  vectors  e k,  k  £  \p\,  such  as 

xi  =  ek  £  the  Z-tli  word  in  the  document  is  k. 


Given  the  corresponding  topic  yj,j  £  [2?’],  words  xi,l  £  [2rn],  are  independently  drawn  according 
to  the  conditional  expectation 


E[x(j_i)n+fe| yj  =  ei]  =ai,ie  [q\,j  £  [2 r\,  k  £  [n], 


(1) 


where  vectors  £  Mp,  i  £  [g],  are  the  conditional  probability  distribution  vectors.  The  matrix  A  = 
[o-i  | a.2 1  •  •  •  | aq]  £  Wxq  collecting  these  vectors  is  the  population  structure  or  topic-word  matrix. 

The  (2rn)-th  order  moment  of  observed  variables  X[,  l  £  [2rn],  for  some  integer  r  >  1,  is  defined  as 
(in  the  matrix  form)[£| 


-^2rn  (®)  • —  E  ^(^r^  $  X2  ®  Xrn')  (Xrn-^-l  G*  Xm-j-2  ®  ^  ^2 rn) 


£ 


(2) 


For  the  ?r-persistent  topic  model  with  2 rn  number  of  observations  (words)  xi,l  £  [2rn],  the  corre¬ 
sponding  moment  is  denoted  by  M^}n{x).  Note  that  to  estimate  the  (2rn)th  moment,  we  require 

6 Vector  x  is  the  vector  generated  by  concatenating  all  vectors  xi,l  £  [2rn]. 
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a  minimum  of  2 rn  words  in  each  document.  We  can  select  the  first  2 rn  words  in  each  document, 
and  average  over  the  different  documents  to  obtain  a  consistent  estimate  of  the  moment.  In  this 
paper,  we  consider  the  problem  of  identifiability  when  exact  moments  are  available. 

The  moment  characterization  of  the  n-persistent  topic  model  is  provided  in  Lemma  [Tj  in  Section 
ED  Given  M^n  (x) ,  what  are  the  sufficient  conditions  under  which  the  population  structure  A  is 
identifiable?  This  is  answered  in  Section  [3J 

Remark  1.  Note  that  our  results  are  valid  for  the  more  general  linear  model  xi  =  Ayj  (more 
precisely,  X(j_i)n+fc  =  Ayj,j  £  [2 r],k  £  [n]),  i.e.,  each  column  of  matrix  A  does  not  need  to  be 
a  valid  probability  distribution.  Furthermore,  the  observed  random  variables  xi,  can  be  continuous 
while  the  hidden  ones  yj  are  assumed  to  be  discrete. 


3  Sufficient  Conditions  for  Generic  Identifiability 

In  this  section,  the  identifiability  result  for  the  n-persistent  topic  model  with  access  to  (2n)-th  order 
observed  moment  is  provided.  First,  sufficient  deterministic  conditions  on  the  population  structure 
A  are  provided  for  identifiability  in  Theorem  |T]  Next,  the  deterministic  analysis  is  specialized  to  a 
random  structured  model  in  Theorem  [2j 

We  now  make  the  notion  of  identifiability  precise.  As  defined  in  literature,  (strict)  identifiability 
means  that  the  population  structure  A  can  be  uniquely  recovered  up  to  permutation  and  scaling 
for  all  A  £  M.pxq.  Instead,  we  consider  a  more  relaxed  notion  of  identifiability,  known  as  generic 
identifiability. 

Definition  1  (Generic  identifiability).  We  refer  to  a  matrix  A  £  M.pxq  as  generic,  with  a  fixed 
sparsity  pattern  when  the  nonzero  entries  of  A  are  drawn  from  a  distribution  which  is  absolutely 
continuous  with  respect  to  Lebesgue  measure LJ.  For  a  given  sparsity  pattern,  the  class  of  population 
structure  matrices  is  said  to  be  generically  identifiable  J25f,  if  all  the  non-identifiable  matrices  form 
a  set  of  Lebesgue  measure  zero. 

The  (2r)-th  order  moment  of  hidden  variables  h  £  M9,  denoted  by  M-2r ( h)  £  M.qrxqr ,  is  defined 
as 


M-2 r{h)  :=  E 


r  times  r  times 


G 


"Xqr 


(3) 


We  now  provide  a  set  of  sufficient  conditions  for  generic  identifiability  of  structured  topic  models 
given  (2rn)-th  order  observed  moment.  We  first  start  with  a  natural  assumption  on  the  hidden 
variables. 

Condition  1  (Non-degeneracy).  The  (2 r)-th  order  moment  of  hidden  variables  h  £  R9,  defined  in 
equation  ©.  is  full  rank  (non- degeneracy  of  hidden  nodes). 

Note  that  there  is  no  hope  of  distinguishing  distinct  hidden  nodes  without  this  non-degeneracy 
assumption.  We  do  not  impose  any  other  assumption  on  hidden  variables  and  can  incorporate 
arbitrarily  correlated  topics. 

7As  an  equivalent  definition,  if  the  non-zero  entries  of  an  arbitrary  sparse  matrix  are  independently  perturbed 
with  noise  drawn  from  a  continuous  distribution  to  generate  A,  then  A  is  called  generic. 
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Figure  2:  A  bipartite  graph  G(Y,  X;  E)  with  |A|  =  4  and  Y  =  6  where  the  edge  set  E  itself  is  a  perfect 
2-gram  matching. 


Furthermore,  we  can  only  hope  to  identify  the  population  structure  A  up  to  scaling  and  permutation. 
Therefore,  we  can  identify  A  up  to  a  canonical  form  defined  as: 

Definition  2  (Canonical  form).  Population  structure  A  is  said  to  be  in  canonical  form  if  all  of  its 
columns  have  unit  norm. 


3.1  Deterministic  conditions  for  generic  identifiability 

In  this  section,  we  consider  a  fixed  sparsity  pattern  on  the  population  structure  A  and  establish 
generic  identifiability  when  non-zero  entries  of  A  are  drawn  from  some  continuous  distribution. 
Before  providing  the  main  result,  a  generalized  notion  of  (perfect)  matching  for  bipartite  graphs 
is  defined.  We  subsequently  impose  these  conditions  on  the  bipartite  graph  from  topics  to  words 
which  encodes  the  sparsity  pattern  of  population  structure  A. 


Generalized  matching  for  bipartite  graphs 

A  bipartite  graph  with  two  disjoint  vertex  sets  Y  and  X  and  an  edge  set  E  between  them  is 
denoted  by  G(Y,  X;  E).  Given  the  bi-adjacency  matrix  A,  the  notation  G(Y,X\A)  is  also  used 
to  denote  a  bipartite  graph.  Here,  the  rows  and  columns  of  matrix  A  G  Ml^lxlyl  are  respectively 
indexed  by  X  and  Y  vertex  sets.  For  any  subset  S  C  Y,  the  set  of  neighbors  of  vertices  in  S 
with  respect  to  A  is  defined  as  Na(S)  :=  {i  G  X  :  Aij  ^  0  for  some  j  G  5},  or  equivalently, 
Ne(S)  :=  {i  G  X  :  (j,i)  G  E  for  some  j  G  S}  with  respect  to  edge  set  E. 

Here,  we  define  a  generalized  notion  of  matching  for  a  bipartite  graph  and  refer  to  it  as  n-gram 
matching. 

Definition  3  ((Perfect)  n-gram  matching).  A  n-gram  matching  M  for  a  bipartite  graph  G(Y,X]E) 
is  a  subset  of  edges  MCE  which  satisfies  the  following  conditions.  First,  for  any  j  G  Y,  we 
have  |JVM(j)|  <  n.  Second,  for  any  ji,j2  G  Y,j1  ±  j2,  we  have  min{|A^M(ji)|,  \NM{j2)\}  > 

\NM{ji)  n  O2 ) | • 

A  perfect  n-gram  matching  or  Y-saturating  n-gram  matching  for  the  bipartite  graph  G(Y,  X;  E)  is 
a  n-gram  matching  M  in  which  each  vertex  in  Y  is  the  end-point  of  exactly  n  edges  in  M . 

In  words,  in  a  n-gram  matching  M,  each  vertex  j  G  Y  is  at  most  the  end-point  of  n  edges  in  M  and 
for  any  pair  of  vertices  in  Y  (ji,j2  G  Y,j \  j2 ),  there  exists  at  least  one  non-common  neighbor  in 
set  X  for  each  of  them  (J\  and  j2). 

As  an  example,  a  bipartite  graph  G(Y,  X;  E)  with  |X|  =  4  and  |Y|  =  6  is  shown  in  Figure  [2]  for 
which  the  edge  set  E  itself  is  a  perfect  2-gram  matching. 
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Remark  2  (Relationship  to  other  matchings).  The  relationship  of  n- gram  matching  to  other  types 
of  matchings  is  discussed  below. 

•  Regular  matching:  For  special  case  n  =  1,  the  (perfect)  n-gram  matching  reduces  to  the  usual 
(perfect)  matching  for  bipartite  graphs. 

•  b-matching:  A  b -matching  for  a  bipartite  graph  G(Y,  X;  E)  (with  equal  vertex  sizes  |A|  =  \Y\) 
is  a  subset  of  edges  Mj,  C  E,  where  each  vertex  is  connected  to  b  edges.  Comparing  with  the 
proposed  perfect  n-gram  matching,  b-matching  does  not  enforce  that  the  set  of  neighbors  be  dif¬ 
ferent,  and  furthermore,  it  requires  that  X  =  Y,  which  is  not  possible  under  the  overcomplete 
setting. 

Remark  3  (Necessary  size  bound).  Consider  a  bipartite  graph  G(Y,  A;  E )  with  |Y|  =  q  and  |A|  =  p 
which  has  a  perfect  n-gram  matching.  Note  that  there  are  ((()  n- combinations  on  X  side  and  each 
combination  can  at  most  have  one  neighbor  (a  node  in  Y  which  is  connected  to  all  nodes  in  the 
combination)  through  the  matching,  and  therefore  we  necessarily  have  q  <  (((). 

Finally,  note  that  the  existence  of  perfect  n-gram  matching  results  the  existence  of  perfect  (n  +  In¬ 
gram  matching!!,  but  the  reverse  is  not  true.  For  example,  the  bipartite  graph  G(Y,  X;  E)  with 
| A' |  =  4  and  |Y|  =  (2)  =  6  in  Figured  has  a  perfect  2-gram  matching,  but  not  a  perfect  (1-gram) 
matching  (since  6  >  4). 


Identifiability  conditions  based  on  existence  of  perfect  n-gram  matching  in  topic-word 
graph 

Now,  we  are  ready  to  propose  the  identifiability  conditions  and  result. 

Condition  2  (Perfect  n-gram  matching  on  A).  The  bipartite  graph  G(Vh,V0',A)  between  hidden 
and  observed  variables,  has  a  perfect  n-gram  matching. 

The  above  condition  implies  that  the  sparsity  pattern  of  matrix  A  is  appropriately  scattered  in 
the  mapping  from  hidden  to  observed  variables  to  be  identifiable.  Intuitively,  it  means  that  every 
hidden  node  can  be  distinguished  from  another  hidden  node  by  its  unique  set  of  neighbors  under 
the  corresponding  n-gram  matching. 

Furthermore,  condition  [2]  is  the  key  to  be  able  to  propose  identifiability  in  the  overcomplete  regime. 
As  stated  in  the  size  bound  in  Remark  [3]  for  n  >  2,  the  number  of  hidden  variables  can  be  more 
than  the  number  of  observed  variables  and  we  can  still  have  perfect  n-gram  matching. 

Definition  4  (Kruskal  rank,  [15]).  The  Kruskal  rank  or  the  krank  of  matrix  A  is  defined  as  the 
maximum  number  k  such  that  every  subset  of  k  columns  of  A  is  linearly  independent. 

Note  that  krank  is  different  from  the  general  notion  of  matrix  rank  and  it  is  a  lower  bound  for  the 
matrix  rank,  i.e. ,  Rank(A)  >  krank(A). 

Condition  3  (Krank  condition  on  A).  The  Kruskal  rank  of  matrix  A  satisfies  the  bound  krank(A)  > 
dmnx(A)n,  where  dma,x(A)  is  the  maximum  node  degree  of  any  column  of  A. 

In  the  overcomplete  regime,  it  is  not  possible  for  A  to  be  full  column  rank  and  krank(A)  <  |  Vyt  =  q. 
However,  note  that  a  large  enough  krank  ensures  that  appropriate  sized  subsets  of  columns  of  A 
are  linearly  independent.  For  instance,  when  krank(A)  >  1,  any  two  columns  cannot  be  collinear 

sNote  that  the  degree  of  each  node  (on  matching  side  Y)  in  the  original  bipartite  graph  should  be  at  least  n  +  1. 
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and  the  above  condition  rules  out  the  collinear  case  for  identihability.  In  the  above  condition,  we 
see  that  a  larger  krank  can  incorporate  denser  connections  between  topics  and  words. 


The  main  identihability  result  under  a  fixed  graph  structure  is  stated  in  the  following  theorem  for 
n  >  2,  where  n  is  the  topic  persistence  level.  The  identihability  result  relies  on  having  access  to 
the  (2rrt)-th  order  moment  of  observed  variables  xi,l  G  [2rn],  dehned  in  equation  ([2])  as 


M2rn(x)  :=  E  Or  <g>  x2 


^m)(^rn+ 1  ^  ^rn+2  C1  *  *  *  &  X2rrfj 


G 


for  some  integer  r  >  1. 

Theorem  1  (Generic  identihability  under  deterministic  topic- word  graph  structure).  Let  M^n{x) 
in  equation  ©  be  the  (2 rn)-th  order  observed  moment  of  the  n-persistent  topic  model  for  some 
integer  r  >  1.  If  the  model  satisfies  conditions  0  0  and0  then,  for  any  n  >  2,  all  the  columns  of 
population  structure  A  are  generically  identihable  from  M^n(x).  Furthermore,  the  (2 r)-th  order 
moment  of  the  hidden  variables,  denoted  by  M2r{h),  is  also  generically  identihable. 


The  theorem  is  proved  in  Appendix  [Aj  It  is  seen  that  the  population  structure  A  is  identihable, 
given  any  observed  moment  of  order  at  least  2 n.  Increasing  the  order  of  observed  moment  results 
in  identifying  higher  order  moments  of  the  hidden  variables. 

The  above  theorem  does  not  cover  the  case  when  the  persistence  level  n  =  1.  This  is  the  usual 
bag-of-words  admixture  model.  Identihability  of  this  model  has  been  studied  earlier  [7]  and  we 
recall  it  below. 

Remark  4  (Bag-of-words  admixture  model,  [7]).  Given  (2 r)-th  order  observed  moments  with  r  >  1, 
the  structure  of  the  popular  bag-of-words  admixture  model  and  the  (2 r)-th  order  moment  of  hidden 
variables  are  identifiable,  when  A  is  full  column  rank  and  the  following  expansion  condition  holds  E 

|JVa(5)|  >  \S\  +  dma,x(A),  VS  C  Vh,  |S|  >  2.  (4) 


Our  result  for  n  >  2  in  Theorem  0  provides  identifiability  in  the  overcomplete  regime  with  weaker 
matching  condition  0  and  krank  condition  0  The  matching  condition  0  is  weaker  than  the  above 
expansion  condition  which  is  based  on  the  perfect  matching  and  hence,  does  not  allow  overcomplete 
models.  Furthermore,  the  above  result  for  the  bag-of-words  admixture  model  requires  full  column 
rank  of  A  which  is  more  stringent  than  our  krank  condition  0 

Remark  5  (Kruskal  rank  and  degree  diversity).  Condition 0  requires  that  the  Kruskal  rank  of  the 
topic-word  matrix  be  large  enough  compared  to  the  maximum  degree  of  the  topics.  Intuitively,  a 
larger  Kruskal  rank  ensures  enough  diversity  in  the  word  supports  among  different  topics  under  a 
higher  level  of  sparsity.  This  Kruskal  rank  condition  also  allows  for  more  degree  diversity  among 
the  topics,  when  the  topic  persistence  level  n  >  1.  On  the  other  hand,  for  the  bag-of-words  model 
(n  =  1),  using  (0)  implies  that  c2drn\n  >  dmax,  where  dmin,dmax  are  the  minimum  and  maximum 
degrees  of  the  topics.  Thus,  we  provide  identifiability  results  with  more  degree  diversity  when  higher 
order  moments  are  employed. 

Remark  6  (Recovery  using  i\  optimization).  It  turns  out  that  our  conditions  for  identifiability 
imply  that  the  columns  of  the  n-gram  matrix  A&n ,  defined  in  Definition 0  are  the  sparsest  vectors  in 
Col (  (x)  j ,  having  a  tensor  rank  of  one.  See  Avvendix [Al  This  implies  recovery  of  the  columns 


of  A  through  exhaustive  search,  which  is  not  efficient.  Efficient  l\-based  recovery  algorithms  have 
been  analyzed  in  0| Iffff  for  the  undercomplete  case  (n  =  1).  They  can  be  employed  here  for  recovery 
from  higher  order  moments  as  well.  Exploiting  additional  structure  present  in  A&n,  for  n  >  1,  such 
as  rank-1  test  devices  proposed  in  fWf  are  interesting  avenues  for  future  investigation. 
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3.2  Analysis  under  random  topic-word  graph  structures 


In  this  section,  we  specialize  the  identifiability  result  to  the  random  case.  This  result  is  based  on 
more  transparent  conditions  on  the  size  and  the  degree  of  the  random  bipartite  graph  G(Vh,  V0 ;  A). 
We  consider  the  random  model  where  in  the  bipartite  graph  G(Vh,V0;  A),  each  node  i  £  Vjx  is 
randomly  connected  to  di  different  nodes  in  set  V0.  Note  that  this  is  a  heterogeneous  degree 
model. 

Condition  4  (Size  condition).  The  random  bipartite  graph  G(Vh,  V0\A)  with  \Vh\  =  q,  \Va\  =  p, 
and  A  £  W°xq,  satisfies  the  size  condition  q  <  (c^)n  for  some  constant  0  <  c  <  1. 

This  size  condition  is  required  to  establish  that  the  random  bipartite  graph  has  a  perfect  n-grarn 
matching  (and  hence  satisfies  deterministic  condition  [2]).  It  is  shown  in  Section  15.2.11  that  the 
necessary  size  constraint  q  =  0(pn )  stated  in  Remark  O  is  achieved  in  the  random  case.  Thus,  the 
above  constraint  allows  for  the  overcomplete  regime,  where  q  S>  p  for  n  >  2,  and  is  tight. 

Condition  5  (Degree  condition).  In  the  random  bipartite  graph  G(Vh,  V0;  A)  with  \  Vfi  =  q,  \VQ\  = 
p,  and  A  £  Mpx?,  the  degree  di  of  nodes  i  £  14  satisfies  the  following  lower  and  upper  bounds 

( di  £  [d  mill;  dxa ax] ) ■ 

•  Lower  bound:  dm\n  >  max{l+/3  logp,  a  logp}  for  some  constants  fi  >  Xog\jci a  >  max{2n2  (/3  log 
1),2  fin}. 

•  Upper  bound:  dmax  <  (cp)«. 

Intuitively,  the  lower  bound  on  the  degree  is  required  to  show  that  the  corresponding  bipartite 
graph  G(Vh,  V0;  A)  has  sufficient  number  of  random  edges  to  ensure  that  it  has  perfect  n-grarn 
matching  with  high  probability.  The  upper  bound  on  the  degree  is  mainly  required  to  satisfy  the 
krank  condition [3l  where  dmax(^4)n  <  krank(^4). 

It  is  important  to  see  that,  for  n  >  2,  the  above  condition  on  degree  covers  a  range  of  models  from 
sparse  to  intermediate  regimes  and  it  is  reasonable  in  a  number  of  applications  that  each  topic  does 
not  generate  a  very  large  number  of  words. 

Definition  5  (whp).  A  sequence  of  events  £p  occurs  with  high  probability  (whp)  if  Pr(£p)  = 

1  —  0(p~e)  for  some  e  >  0. 

The  main  random  identifiability  result  is  stated  in  the  following  theorem  for  n  >  2,  while  n  =  1 
case  is  addressed  in  Remark  [8l  The  identifiability  result  relies  on  having  access  to  the  (2rn)-th 
order  moment  of  observed  variables  xi ,  l  £  [2rn] ,  defined  in  equation  ((2j)  as 


Ahm{x )  :=  E  (xi<S)X2 


Xrn)(,Xrn+ 1  (S)  Xrn-j-2  ®  ^  ^2 rn) 


£ 


for  some  integer  r  >  1. 

Probability  rate  constants:  The  probability  rate  of  success  in  the  following  random  identifia¬ 
bility  result  is  specified  by  constants  f}'  >  0  and  7  =  71  +  72  >  0  as 


0 


7i 


—fi  log  c  —  n  +  1, 

(5) 

2 

(6) 

15 


cn  1e2 


(7) 


72  = 


nn(l  -  52)  ’ 


where  <5i  and  d2  are  some  constants  satisfying  <  1  and  c  nrf  p  <  d2  < 

1. 

Theorem  2  (Random  identifiability).  Let  M^n(x)  in  equation  (J2J)  be  the  (2rn)-th  order  observed 
moment  of  the  n-persistent  topic  model  for  some  integer  r  >  1.  If  the  model  with  random  population 
structure  A  satisfies  conditions QJ  [7]  and{5i  then  whp  (with  probability  at  least  1— 7 p~^  for  constants 
(3'  >  0  and  7  >  0,  specified  in  for  any  n  >  2,  all  the  columns  of  population  structure  A 

are  identifiable  from  M^n(x).  Furthermore,  the  (2 r)-th  order  moment  of  hidden  variables,  denoted 
by  M2r(h),  is  also  identifiable,  whp. 

The  theorem  is  proved  in  Appendix[Bj  Similar  to  the  deterministic  analysis,  it  is  seen  that  the  pop¬ 
ulation  structure  A  is  identifiable  given  any  observed  moment  with  order  at  least  2 n.  Increasing 
the  order  of  observed  moment  results  in  identifying  higher  order  moments  of  the  hidden  variables. 

Remark  7  (Trade-off  between  topic- word  size  ratio  and  degree).  When  the  number  of  hidden 
variables  increases,  i.e.  c  increases,  but  the  order  n  is  kept  fixed,  the  bounds  on  degree  in  condition O 
also  needs  to  grow.  Intuitively,  a  larger  degree  is  needed  to  provide  more  flexibility  in  choosing  the 
subsets  of  neighbors  for  hidden  nodes  to  ensure  the  existence  of  a  perfect  n-gram  matching  in 
the  bipartite  graph,  which  in  turn  ensures  identifiability.  Note  that  as  c  grows,  the  parameter  (3, 
which  is  the  lower  bound  on  d  also  grows,  and  the  probability  rate  (i.e.,  the  term  — /31og c)  remains 
constant.  Hence,  the  probability  rate  does  not  change  as  c  increases,  since  the  increase  in  the  degree 
d  compensates  the  additional  “difficulty”  arising  due  to  a  larger  number  of  hidden  variables. 

The  above  identifiability  theorem  only  covers  for  n  >  2  and  the  n  =  1  case  is  addressed  in  the 
following  remark. 

Remark  8  (Bag-of-words  admixture  model).  The  identifiability  result  for  the  random  bag-of-words 
admixture  model  is  comparable  to  the  result  in  which  considers  exact  recovery  of  sparsely-used 
dictionaries.  They  assume  that  Y  =  DX  is  given  for  some  unknown  arbitrary  dictionary  D  £  M.qxq 
and  unknown  random  sparse  coefficient  matrix  X  £  M'?xp.  They  establish  that  if  D  E  M.qxq  is  full 
rank  and  the  random  sparse  coefficient  matrix  X  E  M.qxp  follows  the  Bernoulli- subgaussian  model 
with  size  constraint  p  >  Cqlogq  and  degree  constraint  O(logg)  <  E[d]  <  O(qlogq),  then  the  model 
is  identifiable,  whp.  Comparing  the  size  and  degree  constraints,  our  identifiability  result  for  n  >  2 
requires  more  stringent  upper  bound  on  the  degree  (d  =  0(p1^n) ),  while  more  relaxed  condition  on 
the  size  (q  =  0(pn))  which  allows  to  identifiability  in  the  overcomplete  regime. 

Remark  9  (The  size  condition  is  tight).  The  size  bound  q  =  0(pn )  in  the  above  theorem  achieves 
the  necessary  condition  that  q  <  (^)  =  0(pn)  (see  Remark \3f),  and  is  therefore  tight.  The  sufficiency 
is  argued  in  Theorem  0  where  we  show  that  the  matching  condition  [H  holds  under  the  above  size 
and  degree  conditions  2]  and  [3 


4  Identifiability  via  Uniqueness  of  Tensor  Decompositions 


In  this  section,  we  characterize  the  moments  of  the  n-persistent  topic  model  in  terms  of  the  model 
parameters,  i.e.  the  topic-word  matrix  A  and  the  moment  of  hidden  variables.  We  relate  identifia- 
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bility  of  the  topic  model  to  uniqueness  of  a  certain  class  of  tensor  decompositions,  which  in  turn, 
enables  us  to  prove  Theorems  [T]  and  [2j  We  then  discuss  the  special  cases  of  the  persistent  topic 
model,  viz.,  the  single  topic  model  (infinite-persistent  topic  model)  and  the  bag-of-words  admixture 
model  (1-persistent  topic  model). 

4.1  Moment  characterization  of  the  persistent  topic  model 

The  moment  characterization  requires  the  following  definition  of  a  n-gram  matrix. 

Definition  6  (n-gram  Matrix).  Given  a  matrix  A  £  Rpx,JJ  its  n-gram  matrix  A®n  £  Mp"x,?  is 
defined  as  the  matrix  whose  (i ,j)-th  entry  is  given  by,  for  i  :=  (i\fi2,  ■  ■  ■  fin)  €  [p]n  and  j  £  [q\, 

n  times 

AQn(i,  j)  :=  AhijAi2d  ■  •  •  AinJ,  or  A&n  :=AQ^~QA. 

That  is,  A®n  is  the  column-wise  nth  order  Kronecker  product  of  n  copies  of  A,  and  is  known  as  the 
Khatri-Rao  product  m- 

In  the  following  lemma,  which  is  proved  in  Appendix  IA. 21  we  characterize  the  observed  moments 
of  a  persistent  topic  model.  Throughout  this  section,  the  order  of  the  observed  moment  is  fixed  to 
2m. 

Lemma  1  (n-persistent  topic  model  moment  characterization).  The  (2m) -th  order  moment  of 
observed  variables,  defined  in  equation  for  the  n-persistent  topic  model  is  characterized  as@: 

•  if  m  =  rn,  for  some  integer  r  >  1,  then 

r  times  r  times 

(x)  =  ^A0n  <g>  •  •  •  ®  AQn\  M2r(h)  ^40n  <8>  <g>  A@n^j  ,  (8) 

where  M2r(h)  £  Rqrxqr  is  the  (2 r)-th  order  moment  of  hidden  variables  h  £  Rq ,  defined  in 
equation  ©• 

•  If  n  >  2m,  then 

M%(x)  =  (A®”)  A/, (ft)  (A®“)T  ,  (9) 

where  M\(h )  :=  Diag(E[/i])  £  M.qxq  is  the  first  order  moment  of  hidden  variables  h  £  M<?, 
stacked  in  a  diagonal  matrix. 

Thus,  we  see  that  the  observed  moments  can  be  expressed  in  terms  of  the  hidden  moments  M(h) 
and  the  Kronecker  products  of  the  n-gram  matrices.  In  the  special  case,  when  the  persistence  level 
is  large  enough  compared  to  the  order  of  the  moment  (n  >  2m),  the  moment  form  reduces  to  a 
Khatri-Rao  product  form  in  ©.  Moreover,  in  ©,  we  have  a  diagonal  matrix  M\  (h)  instead  of 
a  general  (dense)  matrix  M2r(h)  in  ([8]).  when  n  <  2m  =  2 rn.  Thus,  we  have  a  more  succinct 
representation  of  the  moments  in  ([9])  when  the  persistence  level  of  the  topics  is  large  enough. 

In  the  following,  we  contrast  the  special  cases  when  the  persistence  level  n  is  n  — >  oo  (single  topic 
model)  and  n  =  1  (bag  of  words  admixture  model),  as  shown  in  Figl3al  and  Figj3bj  In  order  to 

9  The  other  cases  not  covered  in  Lemma  [Q  are  deferred  to  Appendix  roi  See  Remark  EGD 
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(a)  Single  topic  model 
(infinite-persistent  topic  model) 


(b)  Bag-of-words  admixture  model 
(1-persistent  topic  model) 


Figure  3:  Hierarchical  structure  of  the  single  topic  model  and  bag-of-words  admixture  model  shown  for  2 m 
number  of  words  (views) . 


have  a  fair  comparison,  the  number  of  observed  variables  is  fixed  to  2m  and  the  persistence  level  is 
varied. 

Single  topic  model  (n  — >  oo):  The  condition  in  Q  (n  >  2m)  is  always  satisfied  for  the  single¬ 
topic  model,  since  n  — >  oo  in  this  case,  and  we  have 

(x)  =  (A0m)  Mi  (h)  (A&m) T .  (10) 

Note  that  M \(h)  is  a  diagonal  matrix. 

Bag-of-words  admixture  model  (n  =  1):  From  Lemma  [TJ,  the  (2m)-th  order  moment  of 
observed  variables  xi,l  €  [2m],  for  the  bag-of-words  admixture  model  (1-persistent  topic  model), 
shown  in  Figure  l3bl  is  given  by 

m  times  m  times 

M21 0)  =  ('-A<8>  <g>n)  M2m  (h)  ,  (11) 

where  M2m(/i)  €  IR?™*?771  is  the  (2m)-th  order  moment  of  hidden  variables  h  €  M9,  defined  in  ([3|). 
Note  that  Af2m(/i)  is  a  full  matrix  in  general. 

Contrasting  single  topic  (n  — >  00)  and  bag  of  words  models  (n  =  1):  Comparing  equations 
(HOD  and  (jllll ,  it  is  seen  that  the  moments  under  the  single  topic  model  in  (11011  are  more  “structured” 
compared  to  the  bag  of  words  model  in  (TTTT) .  In  (flTT).  we  have  Kronecker  products  of  the  topic- 
word  matrix  A,  while  (jlOD  involves  Khatri-Rao  products  of  A.  This  forms  a  crucial  criterion  in 
determining  of  whether  overcomplete  models  are  identifiable,  as  discussed  below. 

Why  persistence  helps  in  identifiability  of  overcomplete  models?  For  simplicity,  let  the 
order  of  the  moment  2m  =  4.  The  equations  (HOD  and  CD  reduce  to 

M'loo)  (x)  =  (A  0  A)  Diag  (E  [h])  ( A  ©  A)T ,  (12) 

M|1}(x)  =  (A®A)E[(h®h)(h®h)T](A®  A)T.  (13) 

Note  that  for  the  single  topic  model  in  ()  1 2 [) .  the  Khatri-Rao  product  matrix  A  0  A  G  Wp2xq  has 
the  same  as  the  number  of  columns  (i.e.  the  latent  dimensionality)  of  the  original  matrix  A.  while 
the  number  of  rows  (i.e.  the  observed  dimensionality)  is  increased.  Thus,  the  Khatri-Rao  product 


18 


(a)  Structure  of  an  overcomplete  matrix  A  £  R4x5  having  a  perfect  2-gram  matching. 
1  2  3  4  5 


(b)  Structure  of  A  ©  A  £  R16x5  having  a  perfect  (F-saturating)  matching,  highlighted  by  dashed 
red  edges. 


(4,5)  (5,1)  (5,2)  (5,3)  (5,4)  (5,5) 


(c)  Structure  of  A  <g>  A  £  R16x25.  For  simplicity,  only  a  few  edges  and  nodes  are  shown  and  the 
dashed  edges  denote  the  bunch  of  edges  connected  to  each  node,  not  specifically  shown. 

Figure  4:  An  example  of  an  overcomplete  matrix  A  and  the  matrices  AqA  and  A®  A.  The  corresponding 
bipartite  graphs  encode  the  sparsity  pattern  of  each  of  the  matrices.  A  Q  A  expands  the  effect  of  hidden 
variables  to  second  order  observed  variables  which  is  crucial  for  overcomplete  identifiability,  while  in  the 
A  <g>  A,  the  order  of  both  the  hidden  and  observed  variables  are  increased. 


“expands”  the  effect  of  hidden  variables  to  higher  order  observed  variables,  which  is  the  key  towards 
identifying  overcomplete  models.  In  other  words,  the  original  overcomplete  representation  becomes 
determined  due  to  the  ‘expansion  effect’  of  the  Khatri-Rao  product  structure  of  the  higher  order 
observed  moments. 

On  the  other  hand,  in  the  bag-of- words  admixture  model  in  (1131).  this  interesting  ‘expansion  prop- 
erty’  does  not  occur,  and  we  have  the  Kronecker  product  A®  A  £  xq  ,  in  place  of  the  Khatri-Rao 
products.  The  Kronecker  product  operation  increases  both  the  number  of  the  columns  (i.e.  latent 
dimensionality)  and  the  number  of  rows  (i.e.  observed  dimensionality),  which  implies  that  higher 
order  moments  do  not  help  in  identifying  overcomplete  models. 

An  example  is  provided  in  Figure  |4]  which  helps  to  see  how  the  matrices  AQ  A  and  A®  A  behave 
differently  in  terms  of  mapping  topics  to  word  tuples. 
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Note  that  for  the  n-persistent  model,  for  n  =  2,  the  4th  order  moment  reduces  to 

m|2) \x)  =  (A  ©  A)E[hhT](A  ©  A)r.  (14) 

Contrasting  the  above  equation  with  (1121)  and  (1131).  we  find  that  the  2-persistent  model  retains  the 
desirable  property  of  possessing  Khatri-Rao  products,  while  being  more  general  than  the  form  for 
single  topic  model  in  (|12D.  This  key  property  enables  us  to  establish  identifiability  of  topic  models 
with  finite  persistence  levels. 

4.2  Tensor  algebra  of  the  model 

In  Section  14.11  we  provided  a  representation  of  the  moment  forms  in  the  matrix  form.  We  now 
provide  the  equivalent  tensor  representation  of  the  moments.  The  tensor  representation  is  more 
compact  and  transparent,  and  allows  us  to  compare  the  topic  models  under  different  levels  of  per¬ 
sistence.  We  compare  the  derived  tensor  form  with  the  well-known  Tucker  and  CP  decompositions. 
We  first  introduce  some  tensor  notations  and  definitions. 


4.2.1  Tensor  notations  and  definitions 

A  real-valued  order-n  tensor  A  £  (^))'=1  :=  RPlX"'XPn  is  a  n  dimensional  array  A{  1  :  pi, . . . ,  1  : 

pn),  where  the  z-th  mode  is  indexed  from  1  to  p^.  In  this  paper,  we  restrict  ourselves  to  the  case 
that  p\  =  ■  ■  ■  =  pn  =  p,  and  simply  write  A  £  (£)"  Mp.  A  fiber  of  a  tensor  A  is  a  vector  obtained  by 
fixing  all  indices  of  A  except  one,  e.g.,  for  A  £  (^) 1 M3,  the  vector  /  =  A( 2, 1  :  3, 3, 1)  is  a  fiber. 

For  a  vector  u  £  Mp,  Diagn(u)  £  (^)"  Kp  is  the  n-th  order  diagonal  tensor  with  vector  u  on  its 
diagonal.  The  tensor  A  £  is  stacked  as  a  vector  a  £  Mpri  by  the  vec(-)  operator,  defined 

as 

a  =  vec (A)  a((h  -  1  )pn~1  +  (i2  -  1  )pn~2  H - b  (in- 1  -  1  )p  +  in))  =  A{ii,i2,  ■  ■  ■ ,  in)- 

The  inverse  of  a  =  vec  (A)  operation  is  denoted  by  A  =  ten(o). 

For  vectors  a*  £  M.Pi,i  £  [n],  the  tensor  outer  product  operator  “o”  is  defined  as  m 

n 

A  =  ai  o  a2  o  •  •  •  o  an  £  A(h,i2,  ...,*„)  :=  ai(ii)a2(*2)  •  •  • an(in )■  (15) 

i—  1 

The  above  generated  tensor  is  a  rank-1  tensor.  The  tensor  rank  is  the  minimal  number  of  rank-1  ten¬ 
sors  into  which  a  tensor  can  be  decomposed.  This  type  of  rank  is  called  CP  (Candecomp/Parafac) 
tensor  rank  in  the  literature  El- 

According  to  above  definitions,  for  any  set  of  vectors  a*  £  M.Pi,i  £  [n],  we  have  the  following  pair 
of  equalities: 

vec(ai  o  a2  o  •  •  •  o  an)  =  a\  <g>  a2  <8>  •  •  •  <8>  an, 
ten(ai  <g>  a2  <8>  •  •  •  <8>  an)  =  a±  o  a2  o  •  •  •  o  an. 

For  any  vector  a  £  Mp,  the  power  notations  are  also  defined  as 

n  times 

a0n  :=  £  Mp", 
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n  times 


n 


aon  :=  'a  o  a  cT-  •  •  o  a  €  (££)  Mp. 

The  second  power  is  usually  called  the  n-th  order  tensor  power  of  vector  a. 

Finally,  the  Tucker  and  CP  (Candecomp/Parafac)  representations  are  defined  as  follows  |10l4l|. 
Definition  7  (Tucker  representation).  Given  a  core  tensor  S  £  (^)"=1Mri  and  inverse  factors 
Ui  £  MPiXn,i  £  [n],  the  Tucker  representation  of  the  n-th  order  tensor  A  £  0“=1 

r\  r2  rn 

A=  S(i^i2,...,in)Ui{:,i1)oU2(:,i2)o...oUn(:,in)  =:  [[5;  Uu  U2, .  • . ,  Un]},  (16) 

*1  =  1  12  =  1  *n  =  1 

where  Uj(:,ij)  denotes  the  ij-th  column  of  matrix  Uj.  The  tensor  S  is  referred  to  as  the  core  tensor. 
Definition  8  (CP  representation).  Given  A  £  Mr,  Ui  €  MPiXr,i  £  [n],  the  CP  representation  of  the 
n-th  order  tensor  A  £  (£)  MPi  is 

r 

A  =  ^  W i(:,  *)  o  U2(:,i)  o  ■  ■  ■  o  t/„(:,  *)  =:  [[Diag„(A);  UUU2,...,  Un]},  (17) 

Z=1 

where  Uj(:,i )  denotes  the  i-th  column  of  matrix  Uj. 

Note  that  the  CP  representation  is  a  special  case  of  the  Tucker  representation  when  the  core  tensor 
S  is  square  and  diagonal. 


4.2.2  Tensor  representation  of  moments  under  topic  model 

We  now  provide  a  tensor  representation  of  the  moments. 

For  the  n-persistent  topic  model,  the  2m-th  observed  moment  is  denoted  by  (x0,  which  is  the 
tensor  form  of  the  moment  matrix  ( x ) ,  characterized  in  Lemma  Q3  ft  is  given  by 

T2 m(z)(u,i2,...,i2m)  ■=  ^[xi(il)x2(i2)  ■  ■  ■  X2m(i2m)],  h,  *2,  •  •  •  ,  »2m  €  [p],  (18) 

where  T2m(x)  £  (g)2mRp. 

This  tensor  is  characterized  in  the  following  lemma,  and  is  proved  in  Appendix IA.2I 

Lemma  2  (n-persistent  topic  model  moment  characterization  in  tensor  form).  The  (2m) -th  order 

moment  of  words,  defined  in  equation  Cl),  for  the  n-persistent  topic  model  is  characterized  as0: 

•  if  m  =  rn  for  some  integer  r  >  1,  then 

jfiw-EE-  . «c  (1£>) 

*1  —  1*2  —  1  *2r  —  1 

2 m  times 

=  Sr]'A,  A^. . . ,  A  , 

10The  other  cases  not  covered  in  Lemma  0  are  deferred  to  Appendix  I  A.  21  See  Remark  1 121 
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where  Sr  £  (g)2”  LWq  is  the  core  tensor  in  the  above  Tucker  representation  with  the  sparsity 
pattern  as 


Sr(i) 


M2r(h) 

0 


(  (in  )*2n  5  •  "j  irn)  5  (^(r+l)n  ^(r+2)n  •  5^2  rn)j 


,  Zl  —  22  —  *  *  *  —  h it  ^n+1  —  ^n+2  —  *  *  *  —  ^2n,  •  •  • 
,  O.  W. , 


where  i  :=  (21,22,  •  •  •  M ™)- 
•  If  n  >  2m,  then 


2m  times 

T^(x)  =  Yjm\afm  =  [[Diag2m(E[/i]);X  A^..,A\].  (20) 

*e[g] 


The  tensor  representation  in  (1191)  is  a  specific  type  of  tensor  decomposition  which  is  a  special  case  of 
the  Tucker  representation  (since  Sr  is  not  fully  dense),  but  more  general  than  the  CP  representation. 
The  tensor  representation  in  (1201)  has  a  CP  form. 


Comparison  with  single  topic  model  and  bag-of-words  admixture  model 


We  now  provide  the  tensor  form  for  the  special  cases  single  topic  model  and  bag-of-words  admixture 
model.  In  order  to  have  a  fair  comparison,  the  number  of  observed  variables  is  fixed  to  2m  and  the 
persistence  level  is  varied. 

CP  representation  of  the  single  topic  model:  The  (2m)-th  order  moment  of  the  words  for 
the  single  topic  model  (infinite-persistent  topic  model)  is  provided  in  equation  (1201)  as 


2m  times 

T^\x)  =  Y,  m\afm  =  [[Diag2m(E  [h])-A,  ,  A\].  (21) 

ie[q] 

This  representation  is  the  symmetric  CP  representation^  of  T^\x). 

Tucker  representation  of  the  bag-of-words  admixture  model:  From  Lemma  [2]  the  tensor 
form  of  the  (2m)-th  order  moment  of  observed  variables  xi,  l  £  [2m],  for  the  bag-of-words  admixture 
model  (1-persistent  topic  model)  is  given  by 


9  9 


T2m  (x)=Y'l2"'  Eihiihi2  ■  ■  ■  hi2Jah  oai2o---oai2 

21  =  1  22  =  1  22m  =  1 

2m  times 

E  [h0(2m>];X 


(22) 


This  representation  is  the  Tucker  representation  (decomposition)  of  ( x )  where  the  core  tensor 
S  =  E[h°(2m)]  is  the  tensor  form  of  the  (2m)-th  order  hidden  moment  M2m(h),  defined  in  equation 
(|3|).  and  the  inverse  factors  correspond  to  the  population  structure  A. 

11  In  Appendix  [C]  we  provide  a  more  detailed  comparison  between  our  approach  and  some  of  the  previous  identifi- 
ability  results  for  the  (overcomplete)  CP  decomposition. 
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Figure  5:  Hierarchy  among  the  proposed  conditions  and  results. 


Comparing  the  tensor  forms  for  the  n-persistent  topic  model  ()  1 9 [1 .  single  topic  model  (12111 .  and  bag 
of  words  admixture  model  (|22D.  we  find  that  all  of  them  involve  Tucker  decompositions,  where  the 
inverse  factors  correspond  to  the  topic-word  matrix  A,  and  the  only  difference  is  in  the  sparsity 
level  of  the  core  tensor  S.  For  the  bag  of  words  model,  with  n  =  1,  the  core  tensor  is  fully  dense  in 
general,  while  for  the  single  topic  model,  with  n  — >  oo,  the  core  tensor  is  diagonal  which  reduces 
to  the  CP  decomposition.  For  a  general  topic  model  with  persistence  level  n,  the  core  tensor  is  in 
between  these  two  extremes  and  has  structured  sparsity.  This  sparsity  property  of  the  core  tensor 
is  crucial  towards  establishing  identifiability  in  the  overcomplete  regime.  The  bag-of-words  model 
is  not  identifiable  in  the  overcomplete  regime  since  the  core  tensor  is  fully  dense  in  this  case,  while 
an  overcomplete  n-persistent  topic  model  can  be  identified  under  certain  constraints  provided  in 
Section  [3],  since  the  core  tensor  has  structured  sparsity  and  symmetry. 


5  Proof  Techniques  and  Auxiliary  Results 

The  main  identifiability  results  are  given  in  Theorems  |T]  and  [2]  for  deterministic  and  random  cases  of 
topic-word  graph  structures.  In  this  section,  we  provide  a  proof  sketch  of  these  results,  and  then,  we 
propose  auxiliary  results  on  the  existence  of  perfect  n-grarn  matching  for  random  bipartite  graphs 
and  a  lower  bound  on  the  Kruskal  rank  of  random  matrices. 

5.1  Proof  sketch 

Summary  of  relationships  among  different  conditions:  To  summarize,  there  exists  a  hier¬ 
archy  among  the  proposed  conditions  as  follows.  See  Figure  [5j  First,  in  the  random  analysis,  the 
size  and  the  degree  conditions  |4]  and  [5]  are  sufficient  for  satisfying  the  perfect  n-grarn  matching  and 
the  krank  conditions  [2]  and  [3j  shown  by  Theorems  [3]  and  0J  Then,  these  conditions  [2]  and  [3]  ensure 
that  the  rank  and  the  expansion  conditions  [6]  and  [7]  hold,  shown  by  Lemma  0  And  finally,  these 
conditions  [6]  and  [7]  together  with  non-degeneracy  condition  Q]  conclude  the  primary  identifiability 
result  in  Theorem  [5]  Note  that  the  genericity  of  A  is  also  required  for  these  results  to  hold. 

Primary  deterministic  analysis  in  Theorem  [5}  The  deterministic  analysis  is  primarily  based 
on  conditions  on  the  n-gram  matrix  A0n;  but  since  these  conditions  are  opaque  (mainly  expansion 
condition  on  A0n,  provided  in  condition  [7]) ,  this  analysis  is  related  to  conditions  on  matrix  A  itself. 
See  Theorem  [5]  in  Appendix  IA.1I  for  the  identifiability  result  based  on  A0n.  We  briefly  discuss 
it  below  for  the  case  when  2 n  number  of  words  are  available  under  the  n-persistent  topic  model. 
From  equation  ([8]),  the  (2n)-th  order  moment  of  the  observed  variables  under  the  n-persistent  topic 
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model  can  be  written  as 


M £}  (x)  =  (A0n)  E  [hhT]  (A0n)  T .  (23) 

The  question  is  whether  we  can  recover  A,  given  the  M^{x).  Obviously,  the  matrix  A  is  not 
identifiable  without  any  further  conditions.  First,  non-degeneracy  and  rank  conditions  (conditions 
ID  and  El)  are  required.  Assuming  these  two  conditions,  we  have  from  (|23l)  that 

Co1(m2(”}(x))  =  Col(A®"). 

Therefore,  the  problem  of  recovering  A  from  M^(x)  reduces  to  finding  A0n  in  Col(A0n). 

Then,  we  show  that  under  the  following  expansion  condition  on  A0n  and  the  genericity  property, 
matrix  A  is  identifiable  from  Co1(j40ti) .  The  expansion  condition  (refer  to  condition [7] for  a  more  de¬ 
tailed  statement),  imposes  the  following  property  on  the  bipartite  graph  G(Vh,  V)-/”'* ;  ,4®n)  1^1. 

>  |5|  +  dmax(A0?l),  VS  C  Vh,  |5|  >  krank(A),  (24) 

where  dmax(A0n)  is  the  maximum  node  degree  in  set  Vh,  and  the  restricted  version  of  ?i-gram 
matrix,  denoted  by  ,  is  obtained  by  removing  its  redundant  (identical)  rows  (see  Definition 

ED.  The  identifiability  claim  is  proved  by  showing  that  the  columns  of  A&n  are  the  sparsest  and 
rank-1  vectors  (in  the  tensor  form)  in  Col(A0n)  under  the  expansion  condition  in  (1241)  and  gener¬ 
icity  conditions.  Note  that  since  we  only  require  expansion  on  sets  larger  than  Kruskal  rank,  the 
expansion  condition  (1241)  is  a  more  relaxed  condition  compared  to  expansion  condition  proposed 
in  [71(43]  for  identifiability  in  the  undercomplete  regime.  For  a  more  detailed  comparison,  refer  to 
Remark  EH  in  AppendixlA.il 

Deterministic  analysis  in  Theorem  E}  Expansion  and  rank  conditions  in  Theorem  El  are 
imposed  on  the  n-gram  matrix  A®n.  According  to  the  generalized  matching  notions,  defined  in 
Section  13.11  sufficient  combinatorial  conditions  on  matrix  A  (conditions  [2]  and  [3])  are  introduced 
which  ensure  that  the  expansion  and  rank  conditions  on  4®n  are  satisfied.  The  following  lemma 
is  employed  to  establish  these  results,  where  we  state  an  interesting  property  which  relates  the 
existence  of  a  perfect  matching  in  A&n  to  the  existence  of  a  perfect  n-gram  matching  in  A. 
Lemma  3.  If  G{Y,X\  A )  has  a  perfect  n-gram  matching,  then  G(Y ,  X hd ;  A0n)  has  a  perfect  match¬ 
ing.  In  the  other  direction,  if  G{Y,X^-,A&n)  has  a  perfect  matching  M&n ,  then  G(Y,X\  A) 
has  a  perfect  n-gram  matching  under  the  following  condition  on  M®n.  All  the  matching  edges 
(j,  (*i,...,  in))  €  M®n  should  satisfy  i\  ^  ^  ^  in  for  all  j  €  Y .  In  words,  the  matching  edges 

should  he  connected  to  nodes  in  X^n\  which  are  indexed  by  tuples  of  distinct  indices. 

See  Appendix  IA.4I  for  a  proof.  Using  this  lemma,  condition  [2]  implies  that  G(Y,  X^n' ;  4®n)  has 
a  perfect  matching.  Then,  it  is  straightforward  to  argue  that  the  expansion  and  rank  conditions 
on  AQn  are  satisfied,  which  is  shown  in  Lemma  El  hi  Appendix  IA.3I  This  leads  to  the  generic 
identifiability  result  stated  in  Theorem  El 

ViVon)  denotes  all  ordered  n-tuples  generated  from  set  V„  :=  (1, . . .  ,p}  which  indexes  the  rows  of  A®n. 
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5.2  Analysis  of  Random  Structures 


The  identifiability  result  for  a  random  structured  matrix  A  is  provided  in  Theorem  [2j  Sufficient  size 
and  degree  conditions  2]  and  0  on  the  random  matrix  A  are  proposed  such  that  the  deterministic 
combinatorial  conditions  [2]  and  [3]  on  A  are  satisfied.  The  details  of  these  auxiliary  results  are 
provided  in  the  following  two  subsequent  sections.  In  Section  15.2.1 1  it  is  proved  in  Theorem  [3]  that 
a  random  bipartite  graph  satisfying  reasonable  size  and  degree  constraints,  has  a  perfect  n-gram 
matching  (conditional),  whp.  Then,  a  lower  bound  on  the  Kruskal  rank  of  a  random  matrix  A 
under  size  and  degree  constraints  is  provided  in  Theorem  [4]  in  Section  15.2.21  which  implies  the 
krank  condition  [3l  Intuitions  on  why  such  size  and  degree  conditions  are  required,  are  mentioned 
in  Section  13.21  where  these  conditions  are  proposed. 


5.2.1  Existence  of  perfect  n-gram  matching  for  random  bipartite  graphs 

We  show  in  the  following  theorem  that  a  random  bipartite  graph  satisfying  reasonable  size  and  de¬ 
gree  constraints,  proposed  earlier  in  conditions  [4] and  0  has  a  perfect  n-gram  matching  whp. 
Theorem  3  (Existence  of  perfect  n-gram  matching  for  random  bipartite  graphs).  Consider  a 
random  bipartite  graph  G(Y,  X ;  E )  with  \Y \  =  q  nodes  on  the  left  side  and  \X\  =  p  nodes  on  the  right 
side,  and  each  node  i  £  Y  is  randomly  connected  to  di  different  nodes  in  X.  Let  dmi„  :=  minjgy  di. 
Assume  that  it  satisfies  the  size  condition  q  <  (c^)”  (conditional  for  some  constant  0  <  c  <  1 
and  the  degree  condition  dmin  >  max{  1  +  (3  log  p,  a  log  p}  for  some  constants  /3  >  1(^yc ,  a  > 
max{2n2  (/3  log  ^+1) ,  2/3n}  (lower  bound  in  condition^) .  Then,  there  exists  a  perfect  (Y -saturating) 
n-gram  matching  in  the  random  bipartite  graph  G(Y,X-E),  with  probability  at  least  1  —  71  p~^'  for 
constants  fi'  >  0  and  71  >  0,  specified  in  (JSJ)  and  (]6|). 

Note  that  the  sufficient  size  bound  q  =  0(pn)  in  the  above  theorem  is  also  necessary  (see  Remark 
[3]),  and  is  therefore  tight. 

Remark  10  (Insufficiency  of  the  union  bound  argument).  It  is  easier  to  exploit  the  union  bound 
arguments  to  propose  random  bipartite  graphs  which  have  a  perfect  n-gram  matching  whp.  It  is 
proved  in  Avvendix  \B.1\  that  if  d  >  n  and  the  size  constraint  |K|  =  0(|A|'2_5)  for  some  5  >  0 
is  satisfied,  then  whp,  the  random  bipartite  graph  has  a  perfect  n-gram  matching.  Comparing 
this  result  with  ours  in  Theorem  0  our  approach  has  a  better  size  scaling  while  the  union  bound 
approach  has  a  better  degree  scaling.  The  size  scaling  limitation  in  the  union  bound  argument  makes 
it  unattractive.  In  order  to  identify  the  population  structure  A  in  the  overcomplete  regime  where 
\Y\  =  0(|A|n),  we  need  access  to  at  least  ( 4n)-th  order  moment  under  the  union  bound  argument, 
while  only  the  (2 n)-th  order  moment  is  required  under  our  argument. 


5.2.2  Lower  bound  on  the  Kruskal  rank  of  random  matrices 

In  the  following  theorem,  a  lower  bound  on  the  Kruskal  rank  of  a  random  matrix  A  under  dimension 
and  degree  constraints  is  provided,  which  is  proved  in  AppendixlB.il 

Theorem  4  (Lower  bound  on  the  Kruskal  rank  of  random  matrices).  Consider  a  random  matrix 
A  £  where  for  any  i  £  [5],  there  are  di  number  of  random  non-zero  entries  in  column  i.  Let 
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dm-m  '■=  min  i(z[q)di.  Assume  that  it  satisfies  the  size  condition  q  <  (c^)n  (condition  [^)  for  some 
constant  0  <  c  <  1  and  the  degree  condition  dm in  >  1  +  (3logp  for  some  constant  (3  >  ^ogi/c  Oower 
bound  in  condition  m  and  in  addition  A  is  generic.  Then,  krank(A)  >  cp,  with  probability  at  least 
1  —  72 p~^'  for  constants  (3'  >  0  and  72  >  0,  specified  in  ©  and  ©• 
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Appendix 

A  Proof  of  Deterministic  Identifiability  Result  (Theorem  [Tj) 

First,  we  show  the  identifiability  result  under  an  alternative  set  of  conditions  on  the  n-gram  matrix, 
A®n,  and  then,  we  show  that  the  conditions  of  Theorem  [T]  are  sufficient  for  these  conditions  to 
hold. 


A.l  Deterministic  analysis  based  on  A&n 


In  this  section,  the  deterministic  identifiability  result  based  on  conditions  on  the  n-gram  matrix, 

A0n 

is  provided. 


In  the  n-gram  matrix,  A&n  £  Rp"xg,  redundant  rows  exist.  If  some  row  of  A®n  is  indexed  by 
n-tuple  ,in)  €  (p]n ,  then  another  row  indexed  by  any  permutation  of  the  tuple  (*i, . . .  ,in ) 

has  the  same  entries.  Therefore,  the  number  of  distinct  rows  of  A&n  is  at  most  (p+”  1).  In  the 
following  definition,  we  define  a  non-redundant  version  of  n-gram  matrix  which  is  restricted  to  the 
(potentially)  distinct  rows. 

Definition  9  (Restricted  n-gram  matrix).  For  any  matrix  A  €  Mpx,?;  restricted  n-gram  matrix 
Aj|est  €  Rsxq,  s  =  (p+”-1)j  is  defined  as  the  restricted  version  of  n-gram  matrix  A&n  £ 
where  the  redundant  rows  of  A&n  are  removed,  as  explained  above. 

Condition  6  (Rank  condition).  The  n-gram  matrix  A®n  is  full  column  rank. 


Condition  7  (Graph  expansion).  Let  G{V]llVon'> ;  A®n)  denote  the  bipartite  graph  with  vertex  sets 
14  corresponding  to  the  hidden  variables  (indexing  the  columns  of  A&n)  and  VrJn'1  corresponding  to 
the  n-th  order  observed  variables  (indexing  the  rows  of  A&n)  and  edge  matrix  A&n  £  1SA°  lxlbil. 
The  bipartite  graph  G(Vh,Von^ ;  A&n)  satisfies  the  following  expansion  property  on  the  restricted 
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version  specified  by  , 

>  \S\  +dmax(A0n),  VS  C  Vh,  |5|  >  krank(Vl),  (25) 

where  draaxi^AQn'Sj  is  the  maximum  node  degree  in  set  Vh- 

Remark  11.  The  expansion  condition  for  the  bag-of-words  admixture  model  is  provided  in 
introduced  in  E-  The  proposed  expansion  condition  in  (1251)  is  inherited  from  (J2)) ,  with  two  major 
modifications.  First,  the  condition  is  appropriately  generalized  for  our  model  which  involves  a  graph 
with  edges  specified  by  the  n-gram  matrix,  A&n,  as  stated  in  (ESD-  Second,  the  expansion  property 
®,  proposed  in  E,  needs  to  be  satisfied  for  all  subsets  S  with  size  |S|  >  2,  which  is  a  stricter 
condition  than  the  one  proposed  here  in  (1251).  since  we  can  have  krank(Tl)  2. 

The  deterministic  identifiability  result  based  on  the  conditions  on  A&n,  is  stated  in  the  following 
theorem  for  n  >  2,  while  n  =  1  case  is  addressed  in  Remarks  [4]  and  [TlJ  The  identifiability  result 
relies  on  access  to  the  (2n)-th  order  moment  of  observed  variables  xi,l  £  [2n],  defined  in  equation 
d2J)  as 


M2n(x )  :=  E  (xi  <g)  x2  <8>  ■  ■  ■  <8>  xn)(xn+i  ®  xn+2  <8>  •  •  •  <S>  x2n) 


1  xpn 


Theorem  5  (Generic  identifiability  under  deterministic  conditions  on  A®n).  Let  M^fi  (x)  ( defined 
in  equation  w  be  the  (2 n)-th  order  moment  of  the  n-persistent  topic  model  described  in  Section 
[B  If  the  model  satisfies  conditions  [j]  0  and  fTl  then,  for  any  n  >  2,  all  the  columns  of  population 
structure  A  are  generically  identifiable  from  AT,”  (x) . 

Proof:  Define  B  :=  A0n  £  Rp"x<?.  Then,  the  moment  characterized  in  equation  (1231)  can  be 

written  as  (x)  =  BK  [/i/iT]  B  1  .  Since  both  matrices  E  [hhT]  and  B  have  full  column  rank 
(from  conditions  Q]  and  [6|),  the  rank  of  RE  [/ihT]  BT  is  q  where  q  =  0(pn ),  and  furthermore 
Col(RE  [hhT]  Bt)  =  Col(R).  Let  U  :=  {u\, . . .  ,uq}  £  be  any  basis  of  Col(RE  [hhT]  BT ) 
satisfying  the  following  two  properties: 

1)  ufi s  have  the  smallest  Iq  norms. 

2)  ufi s  have  q  smallest  (tensor)  ranks  in  the  n-th  order  tensor  form,  i.e. ,  Ui  :=  ten {uf),i  £  [q], 
have  q  smallest  ranks. 

Let  the  columns  of  matrix  B  be  bi  for  i  £  [q] .  Since  all  the  bf  s  (which  belong  to  Col(RE  [hhT]  BT)) 
are  rank-1  in  the  n-th  order  tensor  form  (since  ten  (bi)  =  a°n)  and  the  number  of  non-zero  entries 
in  each  of  bfis  is  at  most  dma,x(B)  =  dma,x(A)n ,  we  conclude  that 


max  Rank  (ten  (ui))  =  1  and  max||uj||o  <  dmSiX(B).  (26) 

i  i 

The  above  bounds  are  concluded  from  the  fact  that  bi  £  Col(RE  [hhT]  BT ),  i  £  [q],  and  therefore 
the  £q  norm  and  the  rank  properties  of  bfs  are  upper  bounds  for  the  corresponding  properties  of 
basis  vectors  ufs  (according  to  the  proposed  conditions  for  ufi s). 

Now,  exploiting  these  observations  and  also  the  genericity  of  A  and  the  expansion  condition  [3  we 
show  that  the  basis  vectors  ufi s  are  scaled  columns  of  B.  Since  n,;  for  i  £  [</],  is  a  vector  in  the 
column  space  of  B,  it  can  be  represented  as  Ui  =  Bvi  for  some  vector  Vi  £  R9.  Equivalently,  for 
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any  i  G  [</],  Ui  =  ^2j=i  Vi(j)bj  where  bj  =  a®n  is  the  j-th  column  of  matrix  B  and  Vi(j)  is  a  scalar 
which  is  the  j-th.  entry  of  vector  vt.  Then,  the  tensor  form  of  Ui  can  be  written  as 


q  q  q  ntijnes 

ten(«j)  =  Vj(j)  ten(bj)  =  ^  Vi(J)  ten(a?ln)  =  y^Ui(j)a°n  =  [[Diagn(u;);  7^]], 


(27) 


j= 1 


3= 1 


i=i 


where  the  last  equality  is  based  on  the  notation  defined  in  Definition[8j  We  define  V{  :=  \vi{j)\j-.Vi(j)^ o 
as  the  vector  which  contains  only  the  non-zero  entries  of  m,  i.e. ,  Vi  is  the  restriction  of  vector  Vi 
to  its  support.  Therefore,  vt  G  Rr,  where  r  :=  ||uj||o.  Furthermore,  the  matrix  Ai  :=  {a,j  :  vt (j )  ^ 
0}  G  Rpxr  is  defined  as  the  restriction  of  A  to  its  columns  corresponding  to  the  support  of  u,.  Let 
( ai)j  denote  the  j-th  column  of  T,.  According  to  these  definitions,  equation  (1271)  reduces  to 


n  times 

ten(itj)  =  [[Diagn(vj);  A, ,  Ai]\  =  ^  Si(j)[(Qj)  j]on ,  (28) 

i=i 

which  is  derived  by  removing  columns  of  A  corresponding  to  the  zero  entries  in  Uj. 

Next,  we  rule  out  that  ||uj||o  >  2  under  two  cases  (2  <  ||nj||o  <  krank(A)  and  krank(A)  <  ||nj||o  <  q), 
to  conclude  that  Ui  s  vectors  are  scaled  columns  of  B. 


Case  1:  2  <  ||uj||o  <  krank(A).  Here,  the  number  of  columns  of  Ai  G  Mpx  ll’;‘ll°  is  less  than  or  equal 
to  krank(A)  and  therefore  it  is  full  column  rank.  Since,  all  the  components  of  CP  representation 
in  equation  (l28j)  are  full  column  rankf^l.  for  anvl^l  n  >  2,  we  have  Rank(ten(uj))  =  r  =  ||nj||o  >  1, 
which  contradicts  the  fact  that  max*  Rank(ten(rij))  =  1  in  ()26[) . 


Case  2:  krank(A)  <  ||nj||o  <  q.  Here,  we  hrst  restrict  the  n-gram  matrix  B  to  distinct  rows, 
denoted  by  -BRest.5  as  defined  in  Definition  [9j  Let  u'  =  L>Rest.W  Since  u[  is  the  restricted  version 
of  Ui,  we  have 


ll^illo  —  ll^illo  —  ||RRestTi||o 

>  Idlest.  (SuPPCui))  I  -  I  Supp(nj)| 

T  ^max(R); 

where  the  second  inequality  is  from  Lemma  HJ  and  the  third  inequality  follows  from  the  graph 
expansion  property  (condition  [7]) .  This  result  contradicts  the  fact  that  maxj||uj||o  <  dmax(B)  in 

m- 


From  above  contradictions,  ||uj||o  =  1  and  hence,  columns  of  B  :=  A&n  are  the  scaled  versions 
of  Ui  s.  □ 

13Note  that  for  n  >  3,  this  full  rank  condition  can  be  relaxed  by  Kruskal’s  condition  for  uniqueness  of  CP  decompo¬ 
sition  m  and  its  generalization  to  higher  order  tensors  [2].  Precisely,  instead  of  saying  Rank(A)  =  krank(Ai)  =  r, 
it  is  only  required  to  have  krank(Ai)  >  (2 r  +  n  —  l)/n  to  argue  the  result  of  case  1.  This  only  improves  the  constants 
involved  in  the  final  result. 

14Note  that  for  n  =  1,  since  the  (tensor)  rank  of  any  vector  is  1,  this  analysis  does  not  work. 
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The  following  lemma  is  useful  in  the  proof  of  Theorem  [5j  The  result  proposed  in  this  lemma  is 
similar  to  the  parameter  genericity  condition  in  [7],  but  generalized  for  the  n-grarn  matrix,  A0n . 
The  lemma  is  proved  on  lines  of  the  proof  of  Remark  2.2  in  [7J. 

Lemma  4.  If  A  £  Mpxg  is  generic,  then  the  n-gram  matrix  A0n  £  RP"X<J  satisfies  the  following 
property  with  Lebesgue  measure  one.  For  any  vector  v  £  R9  with  ||n||o  >  2,  we  have 


II^Rest .Hlo  >  ^4®enst  (Su pp(«))  -|Supp(v)|, 

where  for  a  set  S  C  [(/] ,  N^Qn(S)  :=  {i  £  [ p]n  :  A0n(i,j)  /  0  for  some  j  £  S'}. 

Here,  we  prove  the  result  for  the  case  of  n  =  2.  The  proof  can  be  easily  generalized  to  larger 

n. 


Let  A  :=  M  +  Z  be  generic,  where  M  is  an  arbitrary  matrix,  perturbed  by  random  continuous 
perturbations  Z.  Consider  the  2-gram  matrix  B  :=  A®  A  £  Wp  xq  .  It  is  shown  that  the  restricted 

p(p+ 1) 

version  of  B ,  denoted  by  B  :=  BRest.  €  R  2  9,  satisfies  the  above  genericity  condition.  We  first 

establish  some  definitions. 

Definition  10.  We  call  a  vector  fully  dense  if  all  of  its  entries  are  non-zero. 

Definition  11.  We  say  a  matrix  has  the  Null  Space  Property  (NSP)  if  its  null  space  does  not 
contain  any  fully  dense  vector. 

Claim  1.  Fix  any  S  C  [q]  with  |S|  >  2,  and  set  R  :=  N  (2 -gram)  (S).  Let  C  be  a  |S|  x  |S|  submatrix 

~  ~  MRest. 

of  BRis-  Then  Pr(C  has  the  NSP)  =  1. 

Proof  of  Claim  [7}  First,  note  that  B  can  be  expanded  as 

B  :=  (A®  A)  Rest.  =  (AT  ©  M)Rest.  +  (M  Q  Z  +  Z  Q  M)Rest.  +  (Z  0  Z)Rest.  . 

' - - - ' 

:  =  U 

Let  s  =  |S|  and  let  C  =  [ci  | C2 1  •  ■  •  |cs]T,  where  cj  is  the  f-th  row  of  C.  Also,  let  C  :=  [ci  |c2 1  •  •  •  |cs]T 
and  W  :=  [u>i|w2|  ■  ■  ■  |ws]T  be  the  corresponding  |S|  x  |5|  submatrices  of  and  U,  respec¬ 

tively.  For  each  i  £  [s],  denote  by  M%  the  null  space  of  the  matrix  Ci  =  [ci | C2 1  ■  ■  ■  | c*]T.  Finally  let 
A/"o  =  Rs.  Then,  M$  ©  Mi  ©  •  •  •  ©  Ms-  We  need  to  show  that,  with  probability  one,  Ms  does  not 
contain  any  fully  dense  vector. 

If  one  of  Mi,  i  £  [s],  does  not  contain  any  full  dense  vector,  the  result  is  proved.  Suppose  that 
Mi  contains  some  fully  dense  vector  v.  Since  C  is  a  submatrix  of  M^5sram\  every  row  cj+1  of  C 
contains  at  least  one  non-zero  entry.  Therefore, 

vTci+i  =  Y  v(j)di+1(j) 

fe[s] 

=  Y  w0‘)  (c*+i  a ) + ^*+1  (i) ) , 

je[s]:ci+i(j)^0 


where  (tCi+i(j)  :  j  £  [s]  s.t.  Ci+\(j)  /  0}  are  independent  random  variables,  and  moreover,  they  are 
independent  of  c±, ...  ,di  and  thus  of  v.  By  assumption  on  the  distribution  of  the  Wi+\(j), 


Pr 

v  £  Mi+i 

l  _ 

CM 

iO 

HO 

=  Pr 

Y  u0')  ( Ci+1  (j )  +  Wi+i  CO )  =  0 

Cl,C2,  •  •  •  ,Ci 

- 
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Consequently, 


Pr 


dim(A/i+i)  <  dim  (A/)) 


Cl ,  C2 ,  .  .  ■  ,  Ci 


=  1 


(30) 


for  all  i  =  0, . . . ,  s  —  1.  As  a  result,  with  probability  one,  dim(.A/"s)  =  0.  □ 

Now,  we  are  ready  to  prove  Lemma  [H 

Proof  of  Lemma  It  follows  from  Claim  [Tj  that,  with  probability  one,  the  following  event  holds: 
for  every  S  C  [q],  |S|  >  2,  and  every  |5|  x  151  submatrix  C  of  Brs  where  R  :=  N  (2 -gram)  (5),  then 

C  has  the  NSP. 

Now  fix  v  £  Rq  with  ||u||o  >  2.  Let  S  :  =  Supp(u)  and  H  :=  Furthermore,  let  u  £  (M\{0})l5l 

be  the  restriction  of  vector  v  to  S;  observe  that  u  is  fully  dense.  It  is  clear  that  ||-Bu||o  =  ||L^?x||o, 
so  we  need  to  show  that 


||iLu||o  >  \R\  -  |5|.  (31) 

For  the  sake  of  contradiction,  suppose  that  Hu  has  at  most  \R\  —  |5|  non-zero  entries.  Since 
Hu  £  there  is  a  subset  of  |5|  entries  on  which  Hu  is  zero.  This  corresponds  to  a  |5|  x  |5| 
submatrix  of  H  :=  B^s  which  contains  u  in  its  null  space.  It  means  that  this  submatrix  does 
not  have  the  NSP,  which  is  a  contradiction.  Therefore  we  conclude  that  Hu  must  have  more  than 
\R\  —  1 5 1  non-zero  entries,  which  finishes  the  proof.  □ 


A. 2  Proof  of  moment  characterization  lemmata 

Remark  12.  In  Lemmata\7\  and{^  a  specific  case  of  order  and  persistence  (m  =  rn)  was  considered. 
Here,  we  provide  the  moment  form  for  a  more  general  case.  Assume  that  m  =  rn  +  s  for  some 
integers  r>l,l<s<§,  then 


r  times 


(x)  =  (  AQn  0  •  •  •  O  A&n  <g>A(s-gram)  j 


r— 1  times 


M2r(h)  A«n~s)- gram)  <g)  A&Tl 


A0nW(2s_gram)  )  , 


where  M2r{h )  £  M9  xq  is  the  hidden  moment  as 


M  /m  ,=  f  Elhh  ■  ■  ■  hirhl+ihh  ' ' '  hjr+i\  if  V+i  =  ji, 

2r  ((u,...,ir+l),C?lvJr+l))  (  0 


o.  w . 


The  tensor  form  is  also  characterized  as 


2m  times 


T2t)(.x)=  Sr(A,A,...,A 
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where  Sr  £  (f§2in  M9  is  the  core  tensor  in  the  above  Tucker  representation  with  the  sparsity  pattern 
as  follows.  Let  i  :=  (ii,i2,  ■  ■  ■  ,  *2m)-  If 

—  ^2  —  '  '  '  —  ini  in-\- 1  =  in+2  =  •  •  •  =  l2m  *  *  *  ?  i(2r— l)n+l  =  i(2r— l)n+2  —  *  *  *  —  mi 
^2(m— s)+l  ^2(m— s)+2  *  *  *  ^2 mj 


we  have 


5r(i)  =M2r(h)(r  .  ..... 


^(r+2)ri 


•^2rn^2i 


o)- 


Otherwise,  Sr(i)  =  0. 


Proof  of  Lemma  [/}•  In  order  to  simplify  the  notation,  similar  to  tensor  powers  for  vectors,  the 
tensor  power  for  a  matrix  U  £  MpX9  is  defined  as 


r  times 

U®r  :='U<8>U®---®lf  €  Rprxq\ 


(32) 


First,  consider  the  case  m  =  rn  for  some  integer  r  >  1.  One  advantage  of  encoding  ijj ,  j  £  [2 r\, 
by  basis  vectors  appears  in  characterizing  the  conditional  moments.  The  first  order  conditional 
moment  of  words  xi,l  £  [2m],  in  the  n-persistent  topic  model  can  be  written  as 

E[x(j-i)n+k\yj ]  =  Ayj,  j  e  [2r],  k  £  [n], 

where  A  =  [ai|a2|  ■  ■  ■  |ag]  £  MpX9.  Next,  the  m-th  order  conditional  moment  of  different  views 
xi,  l  £  [m],  in  the  ?z-persistent  topic  model  can  be  written  as 

IE[xi  <8>  x2  <8>  ■  ■  ■  <8>  xm\y!  =  eh,y2  =  eh,.  ..,yr  =  eir\  =  a®n  <8>  a0n  <8>  •  •  •  < 8)  a®n, 

which  is  derived  from  the  conditional  independence  relationships  among  the  observations  xi,l  £ 
[m],  given  topics  jjj ,  j  £  [r] .  Similar  to  the  first  order  moments,  since  vectors  yj ,  j  £  [r],  are 
encoded  by  the  basis  vectors  £  R9,  the  above  moment  can  be  written  as  the  following  matrix 
multiplication 


E[x'i  <8>  x2  <8>  •  •  ■  <8>  xm\yi,y2,  ...,yr\  =  (a0")  (yi  <8>  y2  <8>  •  •  •  <8>  yr) , 


(33) 


where  the  (•)0r  notation  is  defined  in  equation  (I32p.  Now  for  the  (2m)-th  order  moment,  we 
have 

M2m  ( x )  :=  E  {X1  ®  x2  0  '  ■  ■  <8>  Xm)(xm+i  (8)  Xm+2  <S>  ■  ■  ■  <S>  X2m)T 


=  E 
=  E 


(yi,V2,—,y2r) 


E 


(xi  <8>  •  •  •  <8>  )  (^m+1 


®2m)T|yi,y2,  ■  ■  ■  ,V2r 


(yi,y2,—,y2r) 


l,S/2,-i!/2r) 

(0  W 

“  JE'(2/l,2/2,-,3/2r) 


E[(xi  (8)  ■  ■  ■  (8)  »m)|yi,  •  •  •  ,2/2r]E[(xm+i 
E[(xi  <8>  ■  ■  ■  <8>  Xm)\yi,  ■  •  •  ,2/r] E[(xm+1  $ 


A0n 


5  x2m)T|yi, . . . ,  y2r] 

'  x2m)  |yr+i,  •  •  • ,  y2r] 


(2/1  <8>  —  <8>  yr)  (yr+ 1  <8)  •  •  •  <8>  y2r) 


T 


i0n 
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(d) 


AQn  )E 

Aen 


(yi  ®  ■  ■  ■  ®  yr)  ( Vr+1  <S) - <S>  2/2r) 

T 


T 


A0r 


M2r(y)  ^ 


A0n 


(34) 


where  (a)  results  from  the  independence  of  (aq, . . . ,  xm)  and  (xm+i, . . . ,  X2m)  given  (yi,  y2,  ■  ■  ■ ,  V2r) 
and  ( b )  is  concluded  from  the  independence  of  (aq, . . . ,  xm)  and  (yr+i , . . .  ,  y2r)  given  (yi, . . . ,  yr) 
and  the  independence  of  (xm+i, . . . ,  x2m)  and  (yi, . . . ,  yr)  given  (yr+i, . . . ,  y2r).  Equation  (1331) 
is  used  in  (c)  and  finally,  the  (2r)-th  order  moment  of  (yi,...,y2r)  is  defined  as  M2r(y)  := 


E 


xT 


in  (d). 


(yi  <8>  ■  ■  ■  ®  yr)  (j/r+1  ®  ■  ■  ■  <8>  y2r) 

For  M2r(y),  we  have  by  the  law  of  total  expectation 

M2r(y)  :=  E[(yi  <g)  •  •  •  <g>  yr)  (yr+i  ®  •••  <8 >y2r)T] 


= 


=  Eft 


E[(yi  <8>  •  •  •  <8>  yr)  (yr+i  ®  •  •  •  ®  2/2rL  |/t] 


r  times 


r  times 


/i  i 


h  )[h< 


i  h 


=  M2r(h ), 


where  the  third  equality  is  concluded  from  the  conditional  independence  of  variables  yj,j  G  [2r], 
given  h  and  the  model  assumption  that  E  [y^  ( /;,]  =  h,j  G  [2r].  Substituting  this  in  equation  (fM]l. 
finishes  the  proof  for  the  n-persistent  topic  model.  Similarly,  the  moment  of  single  topic  model 
(infinite  persistence)  can  be  also  derived.  □ 

Proof  of  Lemma  [B'  Defining  A  :=  M2r(/i)  G  U.qrxqr  and  B  :=  [A0"-]®’'  G  Mprnx<?r,  the  (2rn)-th 
order  moment  M^n{x)  G  RPrnxprn  of  the  n-persistent  topic  model  proposed  in  equation  (JHJ)  can  be 
written  as 

=  babt. 

Let  G  denote  the  corresponding  column  of  B  indexed  by  r-tuple  (*i, . . . ,  ir),  G 

[q] ,  k  G  [r] .  Then,  the  above  matrix  equation  can  be  expanded  as 

M2rl(x)  =  Y  A((^’  ■  ■  •»*»■)>  O'l.  •  ■  ■  Jr))&(n,...,v)6ji,...,>) 
h,...,ir€[q] 
ji,-,jre[q] 

=  J2  . . . ,  tr),  (Ji, . . .  Jr ))[<  ®  •  •  •  ®  ®  ■  ■  ■  ®  af  ]T, 

h,— ,*r6[g] 

A,- ,>£[<?] 

where  relation  ^  =  a0n<8>  •  •  •  <8>a®n,  «i, . . . ,  v  €  [</],  is  used  in  the  last  equality.  Let  m^n(x)  G 

Wp2?n  denote  the  vectorized  form  of  (2rn)-th  order  moment  M^/n{x)  G  RPrnxPr"_  Therefore,  we 
have 


"4™ 0*0  :=vec  (m2(^(x)) 
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Ji,->jVe[g] 


Then,  we  have  the  following  equivalent  tensor  form  for  the  original  model  proposed  in  equation 

{HD 


T2rl(x)  :=ten 

=  E  •  •  •  >ir).  O'l-  •  •  •  ,3r))a%  a-  o  •  •  •  o  a™. 

□ 

A. 3  Sufficient  matching  properties  for  satisfying  rank  and  graph  expansion  con¬ 
ditions 

In  the  following  lemma,  it  is  shown  that  under  a  perfect  n-gram  matching  and  additional  genericity 
and  krank  conditions,  the  rank  and  graph  expansion  conditions  [6] and [7] on  AQn,  are  satisfied. 
Lemma  5.  Assume  that  the  bipartite  graph  G(Vh,  V0;  A)  has  a  perfect  n-gram  matching  (condition 
H  is  satisfied).  Then,  the  following  results  hold  for  the  n-gram  matrix  A&n : 

1)  If  A  is  generic,  A0n  is  full  column  rank  ( condition  EJ)  with  Lebesgue  measure  one  (almost 
surely). 

2)  If  krank  condition^  holds,  A&n  satisfies  the  proposed  expansion  property  in  condition [7| 

Proof:  Let  M  denote  the  perfect  n-gram  matching  of  the  bipartite  graph  G(Vh,  VQ;  A).  From 

Lemma [3l  there  exists  a  perfect  matching  M0n  for  the  bipartite  graph  G(Vh,  Vo"'1',  AQn).  Denote 
the  corresponding  bi-adjacency  matrix  to  the  edge  set  M  as  Am-  Similarly,  Bm  denotes  the 
corresponding  bi-adjacency  matrix  to  the  edge  set  M&n.  Note  that  Supp(Am)  C  Supp(A)  and 
Supp(-Bm)  C  Supp(A0n). 

Since  Bm  is  a  perfect  matching,  it  consists  of  q  :=  \Vh\  rows,  each  of  which  has  only  one  non-zero 
entry,  and  furthermore,  the  non-zero  entries  are  in  q  different  columns.  Therefore,  these  rows  form 
q  linearly  independent  vectors.  Since  the  row  rank  and  column  rank  of  a  matrix  are  equal,  and  the 
number  of  columns  of  Bm  is  q ,  the  column  rank  of  Bm  is  q  or  in  other  words,  Bm  is  full  column 
rank.  Since  A  is  generic,  from  Lemma  [6]  (with  a  slight  modification  in  the  analysisl15!).  ^40n  is  also 
full  column  rank  with  Lebesgue  measure  one  (almost  surely).  This  completes  the  proof  of  part 
1. 

Next,  the  second  part  is  proved.  From  krank  definition,  we  have 

\NA(S')\  >  | S' |  for  S'  QVh,\S'\<  krank(^4), 

15  Lemma  [6]  result  is  about  the  column  rank  of  A  itself,  but  here  it  is  about  the  column  rank  of  T®71  for  which  the 
same  analysis  works.  Note  that  the  support  of  Bm  (which  is  full  column  rank  here)  is  within  the  support  of  T®71 
and  therefore  Lemma  [6]  can  still  be  applied. 
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which  is  concluded  from  the  fact  that  the  corresponding  submatrix  of  A  specified  by  S'  should  be 
full  column  rank.  From  this  inequality,  we  have 

|A’a(<S',)|  >  krank(A)  for  S'  C  Vh,\S'\  =  krank(A).  (35) 


Then,  we  have 

\Na(S)\  >  |iVA(S')l  for  S'  CSC  Vh,  |5|  >  krank(A),  \S'\  =  krank(A), 

>  krank(A) 

>  dmax(A)n,  (36) 


where  (1351)  is  used  in  the  second  inequality  and  the  last  inequality  is  from  krank  condition  [3l 

In  the  restricted  n-gram  matrix  ,  the  number  of  neighbors  for  a  set  S  C  Vh,  |*S'|  >  krank(A), 

can  be  bounded  as 


Na@ „  (S)  >|7VA(S)|  +  |S| 

^Rest. 

>  dmax(A)n  +  |5| 


for  |S|  >  krank(A), 


where  the  first  inequality  is  due  to  the  fact  that  the  set  N  ,Qn  consists  of  rows  indexed  by  the 

Rest. 

following  two  subsets:  n-tuples  i )  where  all  the  indices  are  equal  and  n-tuples  (*i, . . . ,  in) 

with  distinct  indices,  i.e. ,  i\  ^  12  ■  ■  ■  7^  in.  The  former  subset  is  exactly  Na(S)  while  the  size  of 
the  latter  subset  is  at  least  IS)  due  to  the  existence  of  a  perfect  n-gram  matching  in  A.  The  bound 
((Ml)  is  used  in  the  second  inequality.  Since  dmax(AQn^  =  dmax(A)n,  the  proof  of  part  2  is  also 
completed. 

□ 

Remark  13.  The  second  result  of  above  lemma  is  similar  to  the  necessity  argument  of  (Hall’s) 
Theorem  0  for  the  existence  of  perfect  matching  in  a  bipartite  graph,  but  generalized  to  the  case  of 
perfect  n-gram  matching  and  with  additional  krank  condition. 


A. 4  (Auxiliary)  lemma 

Proof  of  Lemma\^  We  show  that  if  G(Y,X ;  A)  has  a  perfect  n-gram  matching,  then  G(Y,  ■  AQn) 

has  a  perfect  matching.  The  reverse  can  be  also  immediately  shown  by  reversing  the  discussion 
and  exploiting  the  additional  condition  stated  in  the  lemma. 

Let  EQn  denote  the  edge  set  of  the  bipartite  graph  G(Y,  AQn).  Assume  G(Y,X\A)  has  a 
perfect  n-gram  matching  MCE.  For  any  j  €  Y,  let  -/Vm(j)  denote  the  set  of  neighbors  of  vertex  j 
according  to  edge  set  M.  Since  M  is  a  perfect  n-gram  matching,  Atm (j )  |  =  n  for  all  j  €  Y.  It  can 
be  immediately  concluded  from  Definition [3] that  sets  NmO)  are  all  distinct,  i.e.,  Nm (j  1 )  /  NmO 2) 
for  any  j\,j2  E  Y,j\  ^  j 2.  For  any  j  E  Y,  let  N'M(j )  denote  an  arbitrary  ordered  n-tuple  generated 
from  the  elements  of  set  NmU)-  From  the  definition  of  n-gram  matrix,  we  have  A&n (N'M(j) ,  j)  ^  0 
for  all  j  E  Y.  Hence,  (j,  N'M (j ) )  £  E&n  for  all  j  E  Y  which  together  with  the  fact  that  all 
tuples  are  distinct,  it  results  that  M&n  :=  {{j,  N'M(j))\j  E  Y }  C  EQn  is  a  perfect  matching  for 
G(Y,Xfo);A0n).  □ 
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Lemma  6.  Consider  matrix  C  £  Rmxr  which  is  generic.  Let  C  E  Mmxr  such  that  Supp(C)  C 
Supp(C)  and  the  non-zero  entries  of  C  are  the  same  as  the  corresponding  non- zero  entries  of  C . 
If  C  is  full  column  rank,  then  C  is  also  full  column  rank,  almost  surely. 

Proof:  Since  C  is  full  column  rank,  there  exists  a  r  x  r  submatrix  of  C,  denoted  by  Cs,  with 

non-zero  determinant,  i.e. ,  det (Cs)  /  0.  Let  Cs  denote  the  corresponding  submatrix  of  C  indexed 
by  the  same  rows  and  columns  as  Cs- 

The  determinant  of  Cs  is  a  polynomial  in  the  entries  of  Cs-  Since  Cs  can  be  derived  from  Cs  by 
keeping  the  corresponding  non-zero  entries,  det  (Cs)  can  be  decomposed  into  two  terms  as 

det  (Cs)  =  det  (Cs)  +  f(Cs), 

where  the  first  term  corresponds  to  the  monomials  for  which  all  the  variables  (entries  of  Cs)  are 
also  in  Cs  and  the  second  term  corresponds  to  the  monomials  for  which  at  least  one  variable  is 
not  in  Cs-  The  first  term  is  non-zero  as  stated  earlier.  Since  C  is  generic,  the  polynomial  /(Cs)  is 
non-trivial  and  therefore  its  roots  have  Lebesgue  measure  zero.  It  implies  that  det  (Cs)  7^  0  with 
Lebesgue  measure  one  (almost  surely),  and  hence,  it  is  full  (column)  rank.  Thus,  C  is  also  full 
column  rank,  almost  surely.  □ 

Finally,  Theorem  |T]  is  proved  by  combining  the  results  of  Theorem  [5]  and  Lemma  [5] 

Proof  of  Theorem  QJ-  Since  conditions  [2]  and  [3]  hold  and  A  is  generic,  Lemma  [5]  can  be  applied 
which  results  that  rank  condition  [6]  is  satisfied  almost  surely  and  expansion  condition  [7]  also  holds. 
Therefore,  all  the  required  conditions  for  Theorem  [5]  are  satisfied  almost  surely  and  this  completes 
the  proof.  □ 


B  Proof  of  Random  Identifiability  Result  (Theorem  [2]) 

We  provide  detailed  proof  of  the  steps  stated  in  the  proof  sketch  of  random  result  in  Section 

ES 

B.l  Proof  of  existence  of  perfect  n-gram  matching  and  Kruskal  results 

Proof  of  Theorem  0.-  Vertex  sets  X  and  Y  are  partitioned,  described  as  follows  (see  Figure  [6]). 
Define  J  :=  c—.  Partition  set  X  uniformly  at  random  into  n  sets  of  (almost)  equal  size0,  denoted 
by  X[,  l  £  [n\.  Define  sets  Xi  :=  u(=1V',  l  E  [n].  Furthermore,  partition  set  Y  uniformly  at  random, 
hierarchically  as  follows.  First,  partition  into  J  sets,  each  with  size  at  most  (c-)n  * ,  and  denote 
them  by  Yi,i  £  [J],  Next,  partition  each  of  these  new  smaller  sets  1)  further  into  J  sets,  each  with 
size  at  most  (c^)"  2.  Do  it  iteratively  up  to  n  —  1  steps,  where  at  the  end,  set  Y  is  partitioned 
into  sets  with  size  at  most  c—.  The  first  two  steps  are  shown  in  Figure  [6l 


Proof  by  induction:  The  existence  of  perfect  n-gram  matching  from  set  Y  to  set  X  is  proved 
by  an  induction  argument.  Consider  one  of  intermediate  sets  in  the  hierarchical  partitioning  of 

16By  almost,  we  mean  the  maximum  difference  in  the  size  of  partitions  is  1  which  is  always  possible. 
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Y 


n _  y2  _ Yj_ 


Figure  6:  Partitioning  of  sets  Y  and  X ,  proposed  in  the  proof  of  Theorem  [3]  Set  X  is  randomly  (uniform) 
partitioned  into  n  sets  of  (almost)  equal  size,  denoted  by  X{.  I  E  [n].  Set  Y  is  also  randomly  partitioned  in 
a  recursive  manner.  In  each  step,  it  is  partitioned  to  J  =  =  0(p)  number  of  sets.  These  smaller  sets  are 

again  partitioned,  recursively.  This  partitioning  process  is  performed  until  reaching  sets  with  size  0(p).  The 
first  two  steps  are  shown  in  this  figure. 


Y  with  size  0[pl)  and  its  further  partitioning  into  J  :=  sets,  each  with  size  O^p1^1),  for  any 
l  E  {2, . . . ,  n}.  In  the  induction  step,  it  is  shown  that  if  there  exists  a  perfect  (/  —  l)-gram  matching 
from  each  of  these  subsets  of  Y  with  size  0(pl~1)  to  AQ_i ,  then  there  exists  a  perfect  Z-grarn 
matching  from  the  original  set  with  size  0(pl)  to  set  X/.  Specifically,  in  the  last  induction  step,  it 
is  shown  that  if  there  exists  a  perfect  (n  —  l)-gram  matching  from  each  set  Y[,l  E  [J],  to  set  Xn_j , 
then  there  exists  a  perfect  n-gram  matching  from  Y  to  Xn  =  X. 


Base  case:  The  base  case  of  induction  argument  holds  as  follows.  By  applying  Lemma  [8]  and 
Lemma  [Tj  there  exists  a  perfect  matching  from  each  partition  in  Y  with  size  at  most  =  0{p)  to 
set  X\ ,  whp. 


Induction  step:  Consider  J  different  bipartite  graphs  Gi(Yi,Xn-\-,Ei),i  E  [J],  by  considering 
sets  Y%  and  An_i  and  the  corresponding  subset  of  edges  E)  C  E  incident  to  them.  See  Figure I7al  The 
induction  step  is  to  show  that  if  each  of  the  corresponding  J  bipartite  graphs  Gj(Y),  Xn_i;  Ei),  i  E 
[J],  has  a  perfect  (n  —  l)-gram  matching,  then  whp,  the  original  bipartite  graph  G(Y,X ;  E)  has  a 
perfect  n-gram  matching. 

Let  us  denote  the  corresponding  perfect  (n  —  l)-gram  matching  of  Gi(Yi,  X„_i;  E^)  by  Af,.  Fur¬ 
thermore,  the  set  of  all  subsets  of  Xn_i  with  cardinality  n  —  1  are  denoted  by  Pn_i(Xn_i),  i.e. , 
Pn- i(Xn_i)  includes  the  sets  with  (n  —  1)  elements  in  the  power  set0  of  .  For  each  set 
S  E  Pn- i(ATn_i),  take  the  set  of  all  nodes  in  Y  which  are  connected  to  all  members  of  S  according 
to  the  union  of  matchings  U {=lMi.  Call  this  set  as  the  parents  of  S,  denoted  by  Pa(S).  According 
to  the  definition  of  perfect  (n  —  l)-gram  matching,  there  is  at  most  one  node  in  each  set  1)  which 
is  connected  to  all  members  of  S  through  the  matching  M,;  and  therefore,  |Pa(S')|  <  J  =  c^.  In 

17The  power  set  of  any  set  S  is  the  set  of  all  subsets  of  S. 
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(a)  Partitioning  of  sets  Y  and  X  pro¬ 
posed  for  the  induction  step. 


Pa(Si)  Pa(S2)  Pa(S3) 


\  perfect  matchings  from  Pa(S')  to  X 


(b)  Partitioning  of  set  Y  through  perfect 
(n  —  l)-gram  matchings  M, .  i  £  [J]. 


Figure  7 :  Auxiliary  figures  for  proof  of  induction  step,  (a)  Partitioning  of  sets  Y  and  X  proposed  in  the  proof, 
where  set  Y  is  partitioned  to  J  :=  Cj-  partitions  Y1, ...  ,Yj  with  (almost)  equal  size,  for  some  constant  c  <  1. 
In  addition,  set  X  is  partitioned  to  two  partitions  Xn_i  and  X'n  with  sizes  |Xn_i|  =  and  \X'n\  = 
The  perfect  (n  —  l)-gram  matchings  Mt,i  £  [J],  through  bipartite  graphs  Gi(Yi:  Xn-i',  Ei),i  £  [J],  are  also 
highlighted  in  the  figure,  (b)  Set  Y  is  partitioned  to  subsets  Pa,(S),S  £  P„- i(Xn_i),  which  is  generated 
through  perfect  (n  —  l)-gram  matchings  Mi,  i  £  [J],  Si,  S2  and  S3  are  three  different  sets  in  Pn-i(Xn-i) 
shown  as  samples.  In  addition,  the  perfect  matchings  from  Pa(5),S  £  Pn- i{Xn-i),  to  X'n,  proposed  in  the 
proof,  are  also  highlighted  in  the  figure. 


addition,  note  that  sets  Pa(S')  impose  a  partitioning  on  set  Y ,  i.e.,  each  node  j  G  Y  is  exactly 
included  in  one  set  Pa(S')  for  some  S  G  Pn- i(Xn_i).  This  is  because  of  the  perfect  (n  —  l)-gram 
matchings  considered  for  sets  Y% , i  G  [J]. 

Now,  a  perfect  n-gram  matching  for  the  original  bipartite  graph  is  constructed  as  follows.  For  any 
S  G  Pn- i(Xn-i),  consider  the  set  of  parents  Pa(S').  Create  the  bipartite  graph  Gg(Pa(5),  X'n]  Eg), 
where  Es  C  E  is  the  subset  of  edges  incident  to  partitions  Pa(S')  C  Y  and  X'n  C  X.  Denote  by 
ds  the  minimum  degree  of  nodes  in  set  Pa(S')  in  the  bipartite  graph  Gs(Pa(S),  X'n\ Es).  Applying 
Lemma  IU  we  have 

TJ  M  ,  ,,  /  /  7  (  2  (^min  -  /3ralog(p/n))2\ 

Pr [ds  >  1  +  /31og {p/n)\  >  1  -  Jexpl - f - 1 - J  (37) 

\  ^  ^min  J 

>  1  -  -p~pl° gl/c  =  1  -  o(P-pi°  gl/c), 

n 

where  /31og  1/c  >  n  —  1,  and  the  last  inequality  is  concluded  from  the  degree  bound  dm\n  >  alogp. 
Furthermore,  we  have  |Pa(5')|  <  =  c|A^J.  Now,  we  can  apply  Lemma [7] concluding  that  there 

exists  a  perfect  matching  from  Pa(S')  to  X'n  within  the  bipartite  graph  Gs(Pa(Sl),  X'n]  Es),  with 
probability  at  least  1  —  O(p“^logl/C).  Refer  to  Figure [7b] for  a  schematic  picture.  The  edges  of  this 
perfect  matching  are  combined  with  the  corresponding  edges  of  the  existing  perfect  (n  —  l)-gram 
matchings  M% ,  i  G  [J],  to  provide  n  incident  edges  to  each  node  i  G  Pa(5).  It  is  easy  to  see  that 
this  provides  a  perfect  n-gram  matching  from  Pa(<S')  to  X. 

We  perform  the  same  steps  for  all  sets  S  G  Pn- i(Xn_i)  to  obtain  a  perfect  n-gram  matching  from 
any  Pa(S’),  S  G  Pn- i(Xn-i),  to  X.  Finally,  according  to  this  construction,  the  union  of  all  of  these 
matchings  is  a  perfect  n-gram  matching  from  UseP„ Pa(S')  =  Y  to  X.  This  finishes  the 
proof  of  induction  step.  Note  that  here  we  analyzed  the  last  induction  step  where  the  existence 
of  perfect  n-gram  matching  is  concluded  from  the  existence  of  corresponding  perfect  (n  —  l)-gram 
matchings.  The  earlier  induction  steps,  where  the  existence  of  perfect  /-gram  matching  is  concluded 
from  the  existence  of  corresponding  perfect  (/  —  l)-gram  matchings  for  any  l  G  {2, . . .  ,?r},  can  be 
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similarly  proven. 


Probability  rate:  We  now  provide  the  probability  rate  of  the  above  events.  Let  N^ip\l  £  [n], 
denote  the  total  number  of  times  that  perfect  matching  result  of  Lemma  [7]  is  used  in  step  Z  in  order 
to  ensure  that  there  exists  a  perfect  Z-gram  matching  from  corresponding  partitions  of  Y  to  set  Xi, 
whp.  Let  iV(hp)  =  £ze[n]iv/hp).  As  earlier,  let  F)_i(X;_i)  denote  the  set  of  all  subsets  of  X/_i 
with  cardinality  l  —  1.  We  have 


According  to  the  construction  method  of  Z-gram  matching  from  (l  —  l)-gram  matchings,  proposed 
in  the  induction  step,  |  P;_i  (X/_i )  |  is  the  number  of  times  Lemma[7]is  used  in  order  to  ensure  that 
there  exists  a  perfect  Z-gram  matching  for  each  partition  on  the  Y  side.  Since  at  most  Jn~l  number 
of  such  Z-gram  matchings  are  proposed  in  step  Z,  the  number  Ayhp'1  can  be  bounded  as 

i\fp)  <  Jn~l  \P,-1  (X,_!)  |  =  Jn~l  ,  z  €  {2, . . . ,  n}.  (38) 

Since  in  the  first  step,  Ar|hp'*  =  Jn number  of  perfect  matchings  needs  to  exist  in  the  above 
discussion,  we  have 


Adhp)  =  jn- 1  +  ^  Nfr>) 
1=2 


n 

<  r~i + jn~l 
1=2 


<  C 


P 


n 


71—1 
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n 

EG 

1=2 
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n 


n—l 


P 


n 


i- 1 


where  inequality  (1381)  is  used  in  the  first  inequality  and  J  :=  and  inequality  (^)  <  (e^)k  are 
exploited  in  the  second  inequality. 

Since  the  result  of  Lemma  [7]  holds  with  probability  at  least  1  —  O(p-,31ogl/c)  and  it  is  assumed  that 
j3  log  1/c  >  n  —  l,  by  applying  union  bound,  we  have  the  existence  of  perfect  n-gram  matching  with 
probability  at  least  1  —  0(p~f3'),  for  (31  =  (3  log  \  —  (n  —  1)  >  0. 

Furthermore,  note  that  the  degree  concentration  bound  in  (1371)  is  also  used  O^p^1)  times.  Since  the 
bound  in  (137])  holds  with  probability  at  least  1 — 0(p-,31ogl/c)  and  it  is  assumed  that  j3  log  1/c  >  n—l, 
this  also  reduces  to  the  same  probability  rate. 

The  coefficient  of  the  above  polynomial  probability  rate  is  also  explicitly  computed,  saying  that  the 
perfect  n-gram  matching  exists  with  probability  at  least  1  —  71  p~^  ,  with 


71  =  e 


n—l 


n—l 
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i-ffi 


n 


P'+i 
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— (3  log  1/c 

where  5\  is  a  constant  satisfying  e2  (  £  )  <  <5i  <  1. 


□ 


Proof  of  Theorem  ^  Let  Cr(Y,  X ;  ^4)  denote  the  corresponding  bipartite  graph  to  matrix  A  where 
node  sets  Y  =  [q]  and  X  =  [p]  index  the  columns  and  rows  of  A  respectively.  Therefore,  |Y|  =  q 
and  |X|  =  p.  Fix  some  S  C  Y  such  that  \S\  <  p.  Then 

Pv(\N(S)\  <  |5|)  <  ]T  W)CT) 


TCX: 

\T\=\S\ 

-  e  n  ('*' 

TCX:  ieS  ^  1 
\T\=\S\ 

*  e  n(^ 

TCX :  icS  V  ^ 
\T\=\S\ 

*  e  n(^ 

TCX:  i£S  V 

m=isi 

p  \  /]5[N  dmin|S| 

|S|J  \t 


(39) 


where  the  bound  (^)/Q*.)  <  (^)  is  used  in  the  second  inequality,  and  the  last  inequality  is 


|5| 

concluded  from  the  fact  that  —  <  1. 

p  ~ 

Let  £  denote  the  event  that  for  any  subset  S  C7  with  |5|  <  r,  we  have  |iV(iS')|  >  |5|,  i.e., 

£  :=  “VS  C  Y  A  1  <  \S\  <  r  :  \N(S)\  >  |S|”. 

Then,  by  the  union  bound  and  inequality  (f39lh  we  have 


Pr(£c)  =  Pr(3S  C  Ys.t.l  <  \S\  <  r  A  \N(S)\  <  |S|)  <  ^ 


S=1 


q\  p\  /  s 


sj  Vs/  \P 


r]  •  b 
Lt,rmn  ° 
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eP-\  S- 
sj  \p 
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S=1 
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^^min  1 


where  the  bound  ()(')  <  is  used  in  the  second  inequality.  For  r  =  cp  ,  the  above  inequality 

reduces  to 


e2crfmi„-2  9 
p 


Pr(£c)  <  J2 

S= 1 
r 

<  ^2  (e2c/cdniin~1pn~1 


S=1 


39 


<  J^eVc^V1) 

s=l  V 

=  ^(e2c'pn-1^losl/c)S 

S=1 

<  &  C  9  7  =  0(p~^'),  for  /?'  =  /?  log  -  —  (n  —  1)  >  0, 

—  e-c  c 

where  the  size  condition  assumed  in  the  theorem  is  used  in  the  second  inequality  with  c!  :=  y(^)n, 
and  the  degree  condition  is  exploited  in  the  third  inequality.  The  last  inequality  is  concluded  from 
the  geometric  series  sum  formula  for  large  enough  p. 

Then,  Lemma  [9]  can  be  applied  concluding  that  krank(Tl)  >  r  =  cp,  with  probability  at  least 
1  —  72 P~13  for  constants  f3'  =  /3  log  -  —  (n  —  1)  >  0  and  72  >  0  as 


72 


cn  1e2 


nn(  1  -  82)  ’ 


where  82  is  a  constant  satisfying  c'e2p  <  82  <  1. 


□ 


Proof  of  Remark  1 1  (A  Consider  a  random  bipartite  graph  G(Y,  X;  E)  where  for  each  node  i  G 
X: 


1.  Neighbors  N(i)  C  X  are  picked  uniformly  at  random  among  all  size  d  subsets  of  X. 

2.  Matching  M(i)  C  N(i)  is  picked  uniformly  at  random  among  all  size  n  subsets  of  N(i). 

Note  that  as  long  as  n  <  d,  the  distribution  of  M{i)  is  uniform  over  all  size  n  subsets  of  X. 
Fix  some  pair  i,  i!  €  Y .  Then 

Pr(M(i)  =  Af(*'))  =  0X ^ 


By  the  union  bound, 


-1 


which  is  ®(\Y\2/\X\n)  when  n  is  constant.  Therefore,  if  d  >  n  and  the  size  constraint  \Y\  =  0(|X|S) 
for  some  s  <  ^  is  satisfied,  then  whp,  there  is  no  pair  of  nodes  in  set  Y  with  the  same  random 
n-gram  matching.  This  concludes  that  the  random  bipartite  graph  has  a  perfect  n-grarn  matching 
whp,  under  these  size  and  degree  conditions. 

□ 


B.2  (Auxiliary)  lemmata 

Lemma  7  (Existence  of  perfect  matching  for  random  bipartite  graphs).  Consider  a  random  bipar¬ 
tite  graph  G(W,  Z\  E)  with  \W\  =  w  nodes  on  the  left  side  and  \Z\  =  z  on  the  right  side,  and  each 
node  i  €  W  is  randomly  connected  to  di  different  nodes  in  set  Z.  Let  dw  :=  min?;e^/  dt.  Assume 
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that  it  satisfies  the  size  condition  w  <  cz  for  some  constant  0  <  c  <  1  and  the  degree  condition 
dw  >  1  +  /51og  z  for  some  constant  (3  >  0.  Then,  there  exists  a  perfect  matching  in  the  random 
bipartite  graph  G(W,Z]E)  with  probability  at  least  1  —  0(z^/3logl^c)  where  fi\og\  >  0. 

Proof:  From  Hall’s  theorem  (Theorem  [6]),  the  existence  of  perfect  matching  for  a  bipartite  graph 

is  equivalent  to  occurrence  of  the  following  event 

£  :=  “VS  C  W  :  |iV(5)|  >  \S\” . 


Similar  to  the  analysis  in  the  proof  of  Theorem  [4J  it  is  concluded  from  union  bound 
Pr(Sc)  =  Pr(3S  C  W s.t.  \N(S)\  <  |S|)  <  Q  Q  ” 
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S  =  1 
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where  the  bound  (£)  <  is  used  in  the  second  inequality.  From  the  assumed  lower  bound  on 

the  degree  dw  and  the  fact  that  0  <  c  <  1,  we  have 
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where  the  second  inequality  is  concluded  from  the  geometric  series  sum  formula  for  large  enough 
2,  and  <5i  is  a  constant  satisfying  e2z_/31ogl/c  <  <5i  <1.  □ 

Lemma  8  (Degree  concentration  bound).  Consider  a  random  bipartite  graph  G(Y,  X;  E)  with 
\Y\=  q  and  |X|  =  p,  where  each  node  i  £  Y  is  randomly  connected  to  di  different  nodes  in  set  X . 
Let  Y'  cY  be  any  subset^!  of  nodes  in  Y  with  size  \Y'\  =  q'  and  X'  C  X  be  a  random  (uniformly 
chosen)  subset  of  nodes  in  X  with  size  \X'\  =  p' .  Create  the  new  bipartite  graph  G(Yr ,  X1;  E') 
where  edge  set  E'  C  E  is  the  subset  of  edges  in  E  incident  to  Y’  and  X' .  Denote  the  degree  of  each 
node  i  €  Y'  within  this  new  bipartite  graph  by  d[.  Let  dm\„  :=  min,;6y  di  and  d'min  :=  min jgy/d(. 
Then,  if  dm ;n  >  for  a  non-negative  integer  r,  we  have 


Pr[dmin  >  r  +  1]  >  1  -  q'  exp  (  -2 (p'/p) 


\  2  (^min  -  ( P/P')r  f 


Proof:  For  any  i  G  Y' ,  we  have 


Pr[d'  <  r]  =  J2 
3=0 


P  ~  P 

di  -  j 


lsNote  that  Y'  need  not  to  be  uniformly  chosen  and  the  result  is  valid  for  any  subset  of  nodes  Y'  C  Y. 
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where  the  inner  term  of  summation  is  a  hypergeometric  distribution  with  parameters  p  (popu¬ 
lation  size),  p'  (number  of  success  states  in  the  population),  d*  (number  of  draws)  and  j  is  the 
hypergeometric  random  variable  denoting  number  of  successes.  The  following  tail  bound  for  the 
hypergeometric  distribution  is  provided  [451146] 

Pr  [d'i  <r}<  exp(— 2f?d;), 

for  ti  >  0  given  by  r  =  —  tfjdi.  Note  that  assumption  dm in  >  jfr  in  the  lemma  is  equivalent  to 

having  fi  >  0,  i  €  Y.  Considering  the  minimum  degree,  for  any  i  €  Y' ,  we  have 

Pr  [d'i  <r}<  exp(— 2  t2dmin), 


for  t  >  0  given  by  r  =  {^—f)dm ;n.  Substituting  t  from  this  equation  gives  the  following  bound 

Pr  [d'i  <r}<  exp  f—2(p'/p)2  ^min  fr/P  )r)  \  _  (40) 

\  ^min  / 

Finally,  applying  the  union  bound,  we  can  prove  the  result  as  follows 


PrKnin  >r  +  1]  =  Pr[n^,{d'  >  r  +  1}] 


>l-^Pr[d'< 


i= 1 
q' 


/  n/  '  /  N2  (^min  ~  (p/p')r)2 
>1-2^  exp  I  -2 (p /p)  — 


i—  1 


dr\ 


=  1  -  (/  exp  (  -2(p'/Py-{dM"-(l,/p')r)2 


where  the  union  bound  is  applied  in  the  first  inequality  and  the  second  inequality  is  concluded  from 

(prop.  □ 

A  lower  bound  on  the  Kruskal  rank  of  matrix  A  based  on  a  sufficient  relaxed  expansion  property 
on  A  is  provided  in  the  following  lemma. 

Lemma  9.  If  A  is  generic  and  the  bipartite  graph  G(Y,  X;A)  satisfies  the  relaxed^  expansion 
property  |A”(5)|  >  1 5*1  for  any  subset  S  CY  with  IS)  <  r,  then  krank(A)  >  r,  almost  surely. 

Before  proposing  the  proof,  we  state  the  marriage  or  Hall’s  theorem  which  gives  an  equivalent 
condition  for  having  a  perfect  matching  in  a  bipartite  graph. 

Theorem  6  (Hall’s  theorem,  |47|).  A  bipartite  graph  G(Y,  X\  E)  has  Y -saturating  matching  if 
and  only  if  for  every  subset  S  C  Y,  the  size  of  the  neighbors  of  S  is  at  least  as  large  as  S,  i.e., 
|1V(S)|  >  |S|. 

Proof  of  Lemma\Q  Denote  the  submatrix  AN^^S  by  As,  i.e.,  As  ■=  AN^S),s-  Exploiting  marriage 
or  Hall’s  theorem,  it  is  concluded  that  the  bipartite  graph  G(S,  N(S);  A$)  has  a  perfect  matching 
Ms  for  any  subset  S  C  Y  such  that  |S|  <  r.  Denote  by  Ams  the  corresponding  matrix  to  this 
perfect  matching  edge  set  Ms,  i.e.,  Ams  keeps  the  non-zero  entries  of  As  on  edge  set  Ms  and 

19  There  is  no  dm ax  term  in  contrast  to  the  expansion  property  proposed  in  condition  [7] 


42 


everywhere  else,  it  is  zero.  Note  that  the  support  of  Ams  is  within  the  support  of  As-  According 
to  the  definition  of  perfect  matching,  the  matrix  Ams  is  full  column  rank.  From  Lemma  El  it  is 
concluded  that  A$  is  also  full  column  rank  almost  surely.  This  is  true  for  any  As  with  SCf  and 
\S\  <  r,  which  directly  results  that  krank(A)  >  r,  almost  surely.  □ 

Finally,  Theorem  [2]  is  proved  by  exploiting  the  random  results  on  the  existence  of  perfect  n-grarn 
matching  and  Kruskal  rank,  provided  in  Theorems  [3]  and  [4j 

Proof  of  Theorem  UJ  We  claim  that  if  random  conditions  [4]  and  [5]  are  satisfied,  then  deterministic 
conditions  [2]  and  [3]  hold  whp.  Then  Theorem  [T]  can  be  applied  and  the  proof  is  done. 

From  size  and  degree  conditions,  Theorem  [3]  can  be  applied,  which  implies  that  the  perfect  n-gram 
matching  condition  [2]  is  satisfied  with  probability  at  least  1  —  7ip-/3  for  j3'  =  j3  log  \  —  (n  —  1)  >  0. 
The  conditions  required  for  Theorem  [I]  also  hold  and  by  applying  this  theorem  we  have  the  bound 
krank(A)  >  cp,  with  probability  at  leastl— j2P~^  •  Combining  this  inequality  with  the  upper  bound 
on  degree  d  in  condition  [5l  we  conclude  that  krank  condition  [3]  is  also  satisfied  whp.  Hence,  all  the 
conditions  required  for  Theorem  Q]  are  satisfied  with  probability  at  least  1  —  7 p~l3  ,  where 


7  =  71  +  72  =  e 


n— 1 
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and  this  completes  the  proof. 


□ 


C  Relationship  to  CP  Decomposition  Uniqueness  Results 

In  this  section,  we  provide  a  more  detailed  comparison  with  some  uniqueness  results  of  overcomplete 
CP  decomposition.  Here,  the  following  CP  decomposition  for  the  third  order  tensor  T  £  Rpxsxq  is 
considered, 


T  =  ^aloblocu  (41) 

i=l 

where  A  =  [a i| . . .  \ar]  €  Rpxr,B  =  [6X | . . .  |6r]  £  Msxr  and  C  =  [ci| . . .  |cr]  £  Rqxr. 

The  most  important  and  general  uniqueness  result  of  CP,  called  Kruskal’s  condition,  is  provided 
in  m,  where  it  is  guaranteed  that  the  above  CP  decomposition  is  unique  if 

krank(A)  +  krank(H)  +  krank(C)  >  2r  +  2. 

Since  then,  several  works  have  analyzed  the  uniqueness  of  CP  decomposition.  One  set  of  works 
assume  that  one  of  the  components,  say  C,  is  full  column  rank  [X7ilT8j.  it  is  shown  in  [18),  for 
generic  (fully  dense)  components  A,B  and  C,  if  r  <  q  and  r(r  —  1)  <  p(p  —  l)s(s  —  l)/2,  then  the 
CP  decomposition  in  (141  [1  is  generically  unique. 

Now,  we  demonstrate  how  this  CP  uniqueness  result  can  be  adapted  to  our  setting.  First,  consider 
the  matrix  M  £  Wpsxq  which  is  obtained  by  stacking  the  entries  of  T  as 

Ad(i—\  )s-\-j,k  -h'ijk- 
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Then,  we  have 


M  =  (A  ©  5)C7t.  (42) 

On  the  other  hand,  for  the  2-persistent  topic  model  with  4  words  (n  =  2 ,m  =  2),  the  moment  can 
be  written  as 


M|2)  (x)  =  (A®  A)E  [hhT]  ( A  O  A)T, 

for  A  €  MpxL  The  following  matrix  has  the  same  column  span  of  Af)  '(x), 

M'  =  (A©  A)C,t , 

for  some  full  rank  matrix  C'  €  Our  random  identifiability  result  in  Theorem [2] provides  the 

uniqueness  of  A  and  C ,  given  M',  under  the  size  condition  q  <  (c|)"  and  the  additional  degree 
condition  [SJ  Note  that  as  discussed  in  the  previous  section,  this  identifiability  argument  is  the  same 
as  the  unique  decomposition  of  the  corresponding  tensor. 

Thus,  in  equation  (1421).  by  setting  A  =  B  and  a  full  rank  square  matrix  C,  we  obtain  the  2-persistent 
topic  model,  under  consideration  in  this  paper.  Thus,  the  identifiability  results  of  [18]  are  applicable 
to  our  setting,  if  we  assume  generic  (i.e.  fully  dense)  matrix  A.  However,  we  incorporate  a  sparse 
matrix  A,  and  therefore,  require  different  techniques  to  provide  identifiability  results.  We  note  that 
the  size  bound  specified  in  [18]  is  comparable  to  the  size  bound  derived  in  this  paper  (for  random 
structured  matrices),  but  we  have  additional  degree  considerations  for  identifiability.  Analyzing 
the  regime  where  the  uniqueness  conditions  of  m  are  satisfied  under  sparsity  constraints  is  an 
interesting  question  for  future  investigation. 
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